Population structure inference in R

6 May 2015

Outline

In this session we will learn a number of approaches for inferring population structure among population samples
Genetics only approaches
- PCA
- K-means
- Hierarchical clustering
Spatially explicit approaches
- sPCA
- MEMGENE (new!)
Light on theory, heavy on practice!
- For a more detailed theoretical exploration of these analyses, see the adegenet website

Multivariate analysis

Also known as:
- Dimension reduction
- Ordination methods
- factorial methods
Designed to reduce complexity of multiple variables to:
- summarise diversity among individuals
- summarise co-variation between variables
Principal components are summarised variables that explain most variation in data

Analysis workflow

Do a PCA on genetic data to reduce data complexity
Select principal components which explain most variation
Use selected PCs to cluster individuals
- K-means
- Hierarchical methods
Group individuals into discrete groups

The adegenet package

For many of these multivariate analyses we will be using the excellent adegenet package.

# install the package
install.packages("adegenet")
# get a help tutorial
library(adegenet)
adegenetTutorial("basics")

Example

Two population model

Simple evolutionary history

Read a genepop file with adegenet

Adegenet package can be used to read genepop data into R

../resources/fsc_linux64/2pop_test/2pop_test_1_1_gen_converted
locus1
locus2
locus3
locus4
locus5
POP
pop1_01 ,   500500  500499  501501  500500  500501
pop1_02 ,   501500  500498  501501  500500  500500
pop1_03 ,   500500  499498  501501  501500  500500
pop1_04 ,   500501  498499  501500  500500  500500
pop1_05 ,   500499  498500  501500  500500  500500
pop1_06 ,   501500  498500  501501  499500  500500
pop1_07 ,   499500  499499  501501  500500  502499
pop1_08 ,   500500  500500  501500  500500  501502
pop1_09 ,   499500  500499  501501  500500  500500
pop1_10 ,   500501  499500  501501  500500  500500
POP
pop2_01 ,   500500  499500  499499  501500  499499
pop2_02 ,   500500  499500  500500  498500  499499
...

Read a genepop file with adegenet

pops <- read.genepop(file = "2pops.gen")
pops

   #####################
   ### Genind object ### 
   #####################
- genotypes of individuals - 

S4 class:  genind
@call: read.genepop(file = "2pops.gen")

@tab:  20 x 18 matrix of genotypes

@ind.names: vector of  20 individual names
@loc.names: vector of  5 locus names
@loc.nall: number of alleles per locus
@loc.fac: locus factor for the  18 columns of @tab
@all.names: list of  5 components yielding allele names for each locus
@ploidy:  2
@type:  codom

Optional contents: 
@pop:  factor giving the population of each individual
@pop.names:  factor giving the population of each individual

@other: - empty -

Data structure

pops variable is an genind S4 object

Uses slots to store separate pieces of information

[email protected]

      L1       L2       L3       L4       L5 
"locus1" "locus2" "locus3" "locus4" "locus5"

Actual genotypes are stored in the tab slot

pops@tab[1:5, 1:5]

   L1.1 L1.2 L1.3 L2.1 L2.2
01  0.0  1.0  0.0  0.0  0.5
02  0.0  0.5  0.5  0.5  0.0
03  0.0  1.0  0.0  0.5  0.5
04  0.0  0.5  0.5  0.5  0.5
05  0.5  0.5  0.0  0.5  0.0

Making sure our data are ready

How you deal with missing data when using multivariate methods is important
- Most common method is to replace missing values with the mean alleles frequency
- Does not inflate variation in PCA
Allele frequencies in @tab should also be scaled (i.e. normalised)
- Both of these tasks can be done using the function scaleGen
```
scaled_pops <- scaleGen(pops, missing = "mean")
```

Running a PCA

Now we can run our PCA using the function dudi.pca.

pca_res <- dudi.pca(df = scaled_pops, scale = FALSE, center = FALSE)

Visualising PCA results

The PCA results look like this

pca_res

Duality diagramm
class: pca dudi
$call: dudi.pca(df = scaled_pops, center = FALSE, scale = FALSE, scannf = FALSE, 
    nf = 2)

$nf: 2 axis-components saved
$rank: 13
eigen values: 3.816 1.493 1.092 0.9362 0.6944 ...
  vector length mode    content       
1 $cw    18     numeric column weights
2 $lw    20     numeric row weights   
3 $eig   13     numeric eigen values  

  data.frame nrow ncol content             
1 $tab       20   18   modified array      
2 $li        20   2    row coordinates     
3 $l1        20   2    row normed scores   
4 $co        18   2    column coordinates  
5 $c1        18   2    column normed scores
other elements: cent norm

Visualising PCA results

The adegenet package has a number of useful plotting functions

s.label(pca_res$li)

Visualising PCA results

Alternatively, we can look as the plot in the context of our population samples

s.class(pca_res$li, fac = pops@pop, col = c("red", "blue"))

Summarising eigenvalue variance

Eigenvalue represent the absolute variance explained
It is common to report the proportion of variation explained per PC

eig_percent <- round((pca_res$eig/(sum(pca_res$eig)))*100,2)
# what are the percent variance explained by PC1 and PC2?
eig_percent[1:2]

[1] 37.94 14.84

DIY: practical 3

Clustering approaches

Why clustering?

PCA is useful for visually exploring how genetic variation is distributed among individuals
Further tools are required to actually use this information to classify individuals into discrete population groups
- K-means clustering
- Hierarchical clustering
We will also look at the performance of Kosman & Leonard's, 2005 similarity coefficient for clustering individuals (experimental!).

K-means

K-means requires the specification of the number of clusters (i.e. centres)
Using this information, the K-means algorithm will randomly initiate cluster centres and then group individuals by minimising within group variation
Process is replicated until convergence is reached (i.e. groups remain stable)
- adegenet automates the process of choosing the number of clusters using Bayesian Information Criterion (BIC) for model selection
```
find.clusters()
```

K-means: in practice

Using a simple 3-population data set, the K-means process is as follows:
1. Read our data in using read.genepop
```
four_pops <- read.genepop(file = "4pop.gen")
```
2. Scale our data and deal with missing genotypes
```
four_pops <- scaleGen(x = four_pops, missing = "mean")
```

K-means: in practice

Now we use the find.clusters function to identify the number of cluster that best explain our data.
```
grps <- find.clusters(x = four_pops, max.n.clust = 10)
```
We are asked to choose how many PC axes to retain for K-means clustering

K-means: in practice

In this case we can keep all PCs since there is no clear asymptote.
Next we are shown a plot of BIC against the number of clusters tested and asked to choose the number of clusters to use to group individuals.

K-means: in practice

Now that we have grouped individuals into discrete clusters we can view how the K-means clusters relate to our original samples.

table.value(table(pop(four_pops), grps$grp), col.lab=paste("grp", 1:4))

K-means: explained

Here is how the samples are positioned on PC1 and PC2

K-means: explained

Colours represent the original sample groupings

K-means: explained

Colours now represent the inferred k-means clusters

Using our clusters for DAPC

Discriminant Analysis of Principal Components
- A method to describe the relationship between inferred clusters
DAPC allows us to understand the variation between groups of individuals, while minimising the variation within groups \[ VAR(X) = B(X) + W(X) \]
In practice, this process is achieved through the construction of a few synthetic variables (discriminant functions), representing linear combinations of the allele frequencies
- lowest within group variation
- highest between group variation

DAPC in practice

The actual dapc command is as follows

dapc_4pop <- dapc(x = four_pops, grp = grps$grp)

Again we will be asked to choose the number of PC axes to retain, and then the number of discriminant functions (usually $k$-1 for < 10 clusters)
- Choosing too many PCs can result in over fitting

DAPC in practice

The resulting object, dapc_4pop contains lots of information
- Posterior membership probabilities (similar to STRUCTURE admixture proportions)
Groupings can be visualised with scatter function

scatter(dapc_4pop, scree.pca = TRUE)

DAPC in practice

Using the posterior membership probabilities, we can also generate plots similar to those generated for STRUCTURE results

library(reshape2)
library(ggplot2)
dat <- melt(dapc_4pop$posterior)
ggplot(dat, aes(x = Var1, y = value, fill = as.factor(Var2))) +
  geom_bar(stat = "identity", width = 1)

DAPC in practice

Some of the challenges associated with DAPC analysis include
- Accurate inference of $k$ (is k-means good enough?)
- Choosing the number of PCs to retain (too many = over-fitting)
The first of these is relatively straightforward to do.
- Use STRUCTURE/BAPS/sPCA/GENELAND….
The second can be assessed in an ad hoc way using adegenet

Choosing the optimal number of PCs to retain

Choosing the number of PCs is a trade-off between too little power and too much
The $a$-score measure the optimal point in this trade-off

test_dapc <- dapc(x = four_pops, grp = grps$grp, n.pc = 60, n.da = 3)
a_score <- optim.a.score(test_dapc)

Choosing the optimal number of PCs to retain

Using this information we can then define our final DAPC k-model

final_dapc <- dapc(x = four_pops, grp = grps$grp, n.pc = 6, n.da = 3)
scatter(final_dapc, scree.pca = TRUE)

K-model Performance

An additional strategy for testing the performance of our groupings is by using a training data set to infer and characterise clusters and then assess the accuracy with which supplementary individuals can be correctly assigned to these clusters.
- Supplementary individuals do not help construct our k-model
- They are usually a subset of the original data
- This process can be automated in R

K-model Performance

The training and the supplementary data sets can be created as follows:

# define the supplementary individuals
sup_ind  <- unlist(tapply(1:nInd(four_pops), pop(four_pops), 
                          function(x){
                            return(sample(x, 20))
                          })
)
# create the supplementary and the training data sets
sup_data <- four_pops[sup_ind]
trn_data <- four_pops[-sup_ind]

K-model Performance

Now we can built our DAPC model with the training data set

test_model <- dapc(trn_data, grp = trn_grps$grp, n.pc = 6, n.da = 3)
scatter(test_model)

K-model Performance

Using the function predict.dapc we can predict the results for our supplementary individuals

pred_mod <- predict.dapc(test_model, newdata = sup_data)

pred_mod contains all of the necessary information about the supplementary individuals in relation to the model defined by the training data.
This information can be assessed a number of ways to understand the performance of our model.

K-model Performance

Now we can visualise the model performance

trn_points <- data.frame(x = test_model$ind.coord[,1],
                         y = test_model$ind.coord[,2],
                         t_points = test_model$grp)
sup_points <- data.frame(x = pred_mod$ind.scores[,1], 
                         y = pred_mod$ind.scores[,2],
                         s_points = pred_mod$assign)
ggplot(trn_points, aes(x = x, y = y)) + 
  geom_point(aes(colour = t_points), size = 10, alpha = 0.2) +
  geom_point(data = sup_points, aes(x = x, y = y, colour = s_points), 
             pch = 15, size = 6) +
  theme_bw()

K-model Performance

Additionally, we can look at the proportions of supplementary individuals correctly assigned to their population of origin

table.value(table(pred_mod$assign, pop(sup_data)),
            col.lab=levels(pop(sup_data)))

Fun time again: Practical 4

Spatially explicit structure inference

Although this does not appear to be the case for our brown trout populations, sometimes genetic data alone does not provide the discriminatory power to group individuals unambiguously.
Genetic processes often results in spatial structuring
- Island model (+ve)
- Hierarchical island model (+ve)
- Inbreeding avoidance (-ve)
- Isolation by distance (+ve)

Spatial PCA

Implemented in adegenet
Uses spatial information (e.g. xy/lat long coordinates/spatial weights) to identify correlated genetic patterns
Is genetic variation spatially autocorrelated?
- Positive relationship suggests individuals are more closely related, genetically, over smaller spatial scales than random
- Negative relationship suggest the opposite

Spatial PCA in practice

Summarise genetic variation into a few variables (good ol' PCA)
Construct "composite variables" that are a compromise between variability in both genetic variation and spatial structure
Choose which composite variables to interpret
Visualise the structure
Use sPCA axes to cluster individuals (experimental!)

sPCA: toy example

   #####################
   ### Genind object ### 
   #####################
- genotypes of individuals - 

S4 class:  genind
@call: NULL

@tab:  335 x 55 matrix of genotypes

@ind.names: vector of  335 individual names
@loc.names: vector of  9 locus names
@loc.nall: number of alleles per locus
@loc.fac: locus factor for the  55 columns of @tab
@all.names: list of  9 components yielding allele names for each locus
@ploidy:  2
@type:  codom

Optional contents: 
@pop:  - empty -
@pop.names:  - empty -

@other: a list containing: xy  mnt  showBauges

sPCA: toy example

Let's run a PCA on these data

scaled_rup <- scaleGen(rupica)
rup_pca <- dudi.pca(scaled_rup, scale = FALSE, center = FALSE)

sPCA: toy example

Clear from the bar plot that no one PC is explaining a large proportion of the total genetic variation

sPCA: toy example

Nothing too clear looking at the $1^{st}$ and $2^{nd}$ PCs.
Most variation explained on the first axis correspond to outliers

sPCA: toy example

Using standard PCA there does not appear to be any structuring among the chamois individuals sampled.
Now, using additional information (i.e. the geographical relationship among individual), can we refine the distribution of genetic variation?
The first step is to define, in some way, the geographical relationships.
- Chamois home range is roughly a circle with a radius of 1150m
```
library("spdep")
rup_network <- chooseCN(rupica@other$xy, type = 5, d1 = 0, d2 = 2300, 
                res = "listw")
```

sPCA: toy example

The network looks like this

sPCA: toy example

We can run an Moran's I tests to if there is significant spatial autocorrelation within the genetic data

pc1_moran <- moran.mc(x = rup_pca$li[,1], listw = rup_network, nsim = 999)
plot(pc1_moran)
text(x = 0.015, y = 50, paste("p.value = ", pc1_moran$p.value, sep = ""))

sPCA: toy example

What about PC2?

pc2_moran <- moran.mc(x = rup_pca$li[,2], listw = rup_network, nsim = 999)
plot(pc2_moran)
text(x = 0.015, y = 20, paste("p.value = ", pc2_moran$p.value, sep = ""))

sPCA: toy example

From these tests we can see that PC1 has significant positive spatial autocorrelation, suggesting that spatially close individuals have more similar genotypes.
The next challenge is to assess whether there is multivariate spatial autocorrelation.
- Calculating the Euclidean distance between scaled individual's genotypes (dist)
- Calculating the Euclidean distance between individual's xy coordinates (dist)
- Run a Mantel test to test for a correlation between these two distances (mantel.randtest)

sPCA: toy example

Mantel test

man_test <- mantel.randtest(m1 = dist(scaled_rup), m2 = dist(rupica@other$xy))
plot(man_test)
text(x = (-0.07), y = 200, paste("p.value = ", man_test$pvalue, sep = ""))

sPCA: toy example

Good evidence to support the use of sPCA to extract the spatial patterns within the data
```
rup_spca <- spca(obj = rupica, cn = rup_network)
```
We will be shown a bar plot and asked how many first axes we wish to retain, and how many second axes we wish to retain:
- first axes refer to positive, global structures
- second axes refer to negative, local structure

sPCA: toy example

SPC1 certainly appears to explain a large portion of the spatial genetic structure. SPC2 is questionable.

sPCA: toy example

A screeplot of spatial autocorrelation against genetic variance
- SPC1 stands out, while all other SPCs do not.
- Suggests all of the spatially meaningful genetic variation is in SPC1

screeplot(rup_spca)

sPCA: toy example

Now we can go on to visualise these results on the map (SPC1)

sPCA: toy example

Now we can go on to visualise these results on the map (SPC2)

sPCA: toy example

Visually, both axes show some interesting spatial patterns.
The function colorplot can be used to visualise both axes together

rupica@other$showBauges()
colorplot(rupica$other$xy, rup_spca$ls, axes=1:2, transp=TRUE, 
          add=TRUE, cex = 3)

What next?

Obviously, these plotting techniques allow us to visualise genetic structure, but what about classifying individual into discrete groups?
Let's experiment with hierarchical clustering

spca_tree <- hclust(dist(rup_spca$ls[,1:2]), method = "ward.D2")
plot(spca_tree)

What next?

Choose a cut point in the tree

What next?

Using the following command, we get the group assignment of each individual based on clustering of sPCA axes 1 and 2.
```
spca_grps <- cutree(spca_tree, h = 0.3)
```

  8  63  64  66  67  68  69  70  71  72  74  76  77  78  81  82  83  85  86 162 
  1   2   1   1   1   1   1   1   1   1   1   1   2   2   1   1   1   1   1   1