6 May 2015

Outline

  • In this session we will learn a number of approaches for inferring population structure among population samples
  • Genetics only approaches
    • PCA
    • K-means
    • Hierarchical clustering
  • Spatially explicit approaches
    • sPCA
    • MEMGENE (new!)
  • Light on theory, heavy on practice!
    • For a more detailed theoretical exploration of these analyses, see the adegenet website

Multivariate analysis

  • Also known as:
    • Dimension reduction
    • Ordination methods
    • factorial methods
  • Designed to reduce complexity of multiple variables to:
    • summarise diversity among individuals
    • summarise co-variation between variables
  • Principal components are summarised variables that explain most variation in data

Analysis workflow

  1. Do a PCA on genetic data to reduce data complexity
  2. Select principal components which explain most variation
  3. Use selected PCs to cluster individuals
    • K-means
    • Hierarchical methods
  4. Group individuals into discrete groups

The adegenet package

  • For many of these multivariate analyses we will be using the excellent adegenet package.

    # install the package
    install.packages("adegenet")
    # get a help tutorial
    library(adegenet)
    adegenetTutorial("basics")

Example

Two population model

  • Simple evolutionary history

Read a genepop file with adegenet

  • Adegenet package can be used to read genepop data into R
../resources/fsc_linux64/2pop_test/2pop_test_1_1_gen_converted
locus1
locus2
locus3
locus4
locus5
POP
pop1_01 ,   500500  500499  501501  500500  500501
pop1_02 ,   501500  500498  501501  500500  500500
pop1_03 ,   500500  499498  501501  501500  500500
pop1_04 ,   500501  498499  501500  500500  500500
pop1_05 ,   500499  498500  501500  500500  500500
pop1_06 ,   501500  498500  501501  499500  500500
pop1_07 ,   499500  499499  501501  500500  502499
pop1_08 ,   500500  500500  501500  500500  501502
pop1_09 ,   499500  500499  501501  500500  500500
pop1_10 ,   500501  499500  501501  500500  500500
POP
pop2_01 ,   500500  499500  499499  501500  499499
pop2_02 ,   500500  499500  500500  498500  499499
...

Read a genepop file with adegenet

pops <- read.genepop(file = "2pops.gen")
pops

   #####################
   ### Genind object ### 
   #####################
- genotypes of individuals - 

S4 class:  genind
@call: read.genepop(file = "2pops.gen")

@tab:  20 x 18 matrix of genotypes

@ind.names: vector of  20 individual names
@loc.names: vector of  5 locus names
@loc.nall: number of alleles per locus
@loc.fac: locus factor for the  18 columns of @tab
@all.names: list of  5 components yielding allele names for each locus
@ploidy:  2
@type:  codom

Optional contents: 
@pop:  factor giving the population of each individual
@pop.names:  factor giving the population of each individual

@other: - empty -

Data structure

  • pops variable is an genind S4 object
    • Uses slots to store separate pieces of information
    [email protected]
          L1       L2       L3       L4       L5 
    "locus1" "locus2" "locus3" "locus4" "locus5" 
  • Actual genotypes are stored in the tab slot
pops@tab[1:5, 1:5]
   L1.1 L1.2 L1.3 L2.1 L2.2
01  0.0  1.0  0.0  0.0  0.5
02  0.0  0.5  0.5  0.5  0.0
03  0.0  1.0  0.0  0.5  0.5
04  0.0  0.5  0.5  0.5  0.5
05  0.5  0.5  0.0  0.5  0.0

Making sure our data are ready

  • How you deal with missing data when using multivariate methods is important
    • Most common method is to replace missing values with the mean alleles frequency
    • Does not inflate variation in PCA
  • Allele frequencies in @tab should also be scaled (i.e. normalised)
    • Both of these tasks can be done using the function scaleGen

      scaled_pops <- scaleGen(pops, missing = "mean")

Running a PCA

  • Now we can run our PCA using the function dudi.pca.

    pca_res <- dudi.pca(df = scaled_pops, scale = FALSE, center = FALSE)

Visualising PCA results

  • The PCA results look like this
pca_res
Duality diagramm
class: pca dudi
$call: dudi.pca(df = scaled_pops, center = FALSE, scale = FALSE, scannf = FALSE, 
    nf = 2)

$nf: 2 axis-components saved
$rank: 13
eigen values: 3.816 1.493 1.092 0.9362 0.6944 ...
  vector length mode    content       
1 $cw    18     numeric column weights
2 $lw    20     numeric row weights   
3 $eig   13     numeric eigen values  

  data.frame nrow ncol content             
1 $tab       20   18   modified array      
2 $li        20   2    row coordinates     
3 $l1        20   2    row normed scores   
4 $co        18   2    column coordinates  
5 $c1        18   2    column normed scores
other elements: cent norm 

Visualising PCA results

  • The adegenet package has a number of useful plotting functions
s.label(pca_res$li)

Visualising PCA results

  • Alternatively, we can look as the plot in the context of our population samples
s.class(pca_res$li, fac = pops@pop, col = c("red", "blue"))

Summarising eigenvalue variance

  • Eigenvalue represent the absolute variance explained
  • It is common to report the proportion of variation explained per PC
eig_percent <- round((pca_res$eig/(sum(pca_res$eig)))*100,2)
# what are the percent variance explained by PC1 and PC2?
eig_percent[1:2]
[1] 37.94 14.84

DIY: practical 3

Clustering approaches

Why clustering?

  • PCA is useful for visually exploring how genetic variation is distributed among individuals
  • Further tools are required to actually use this information to classify individuals into discrete population groups
    • K-means clustering
    • Hierarchical clustering
  • We will also look at the performance of Kosman & Leonard's, 2005 similarity coefficient for clustering individuals (experimental!).

K-means

  • K-means requires the specification of the number of clusters (i.e. centres)
  • Using this information, the K-means algorithm will randomly initiate cluster centres and then group individuals by minimising within group variation
  • Process is replicated until convergence is reached (i.e. groups remain stable)
    • adegenet automates the process of choosing the number of clusters using Bayesian Information Criterion (BIC) for model selection

      find.clusters()

K-means: in practice

  • Using a simple 3-population data set, the K-means process is as follows:
    1. Read our data in using read.genepop

      four_pops <- read.genepop(file = "4pop.gen")
    2. Scale our data and deal with missing genotypes

      four_pops <- scaleGen(x = four_pops, missing = "mean")

K-means: in practice

  • Now we use the find.clusters function to identify the number of cluster that best explain our data.

    grps <- find.clusters(x = four_pops, max.n.clust = 10)
  • We are asked to choose how many PC axes to retain for K-means clustering

K-means: in practice

  • In this case we can keep all PCs since there is no clear asymptote.
  • Next we are shown a plot of BIC against the number of clusters tested and asked to choose the number of clusters to use to group individuals.

K-means: in practice

  • Now that we have grouped individuals into discrete clusters we can view how the K-means clusters relate to our original samples.
table.value(table(pop(four_pops), grps$grp), col.lab=paste("grp", 1:4))

K-means: explained

  • Here is how the samples are positioned on PC1 and PC2

K-means: explained

  • Colours represent the original sample groupings

K-means: explained

  • Colours now represent the inferred k-means clusters

Using our clusters for DAPC

  • Discriminant Analysis of Principal Components
    • A method to describe the relationship between inferred clusters
  • DAPC allows us to understand the variation between groups of individuals, while minimising the variation within groups \[ VAR(X) = B(X) + W(X) \]
  • In practice, this process is achieved through the construction of a few synthetic variables (discriminant functions), representing linear combinations of the allele frequencies
    • lowest within group variation
    • highest between group variation

DAPC in practice

  • The actual dapc command is as follows

    dapc_4pop <- dapc(x = four_pops, grp = grps$grp)
  • Again we will be asked to choose the number of PC axes to retain, and then the number of discriminant functions (usually \(k\)-1 for < 10 clusters)
    • Choosing too many PCs can result in over fitting

DAPC in practice

  • The resulting object, dapc_4pop contains lots of information
    • Posterior membership probabilities (similar to STRUCTURE admixture proportions)
  • Groupings can be visualised with scatter function
scatter(dapc_4pop, scree.pca = TRUE)

DAPC in practice

  • Using the posterior membership probabilities, we can also generate plots similar to those generated for STRUCTURE results
library(reshape2)
library(ggplot2)
dat <- melt(dapc_4pop$posterior)
ggplot(dat, aes(x = Var1, y = value, fill = as.factor(Var2))) +
  geom_bar(stat = "identity", width = 1)

DAPC in practice

  • Some of the challenges associated with DAPC analysis include
    • Accurate inference of \(k\) (is k-means good enough?)
    • Choosing the number of PCs to retain (too many = over-fitting)
  • The first of these is relatively straightforward to do.
    • Use STRUCTURE/BAPS/sPCA/GENELAND….
  • The second can be assessed in an ad hoc way using adegenet

Choosing the optimal number of PCs to retain

  • Choosing the number of PCs is a trade-off between too little power and too much
  • The \(a\)-score measure the optimal point in this trade-off
test_dapc <- dapc(x = four_pops, grp = grps$grp, n.pc = 60, n.da = 3)
a_score <- optim.a.score(test_dapc)

Choosing the optimal number of PCs to retain

  • Using this information we can then define our final DAPC k-model
final_dapc <- dapc(x = four_pops, grp = grps$grp, n.pc = 6, n.da = 3)
scatter(final_dapc, scree.pca = TRUE)

K-model Performance

  • An additional strategy for testing the performance of our groupings is by using a training data set to infer and characterise clusters and then assess the accuracy with which supplementary individuals can be correctly assigned to these clusters.
    • Supplementary individuals do not help construct our k-model
    • They are usually a subset of the original data
    • This process can be automated in R

K-model Performance

  • The training and the supplementary data sets can be created as follows:
# define the supplementary individuals
sup_ind  <- unlist(tapply(1:nInd(four_pops), pop(four_pops), 
                          function(x){
                            return(sample(x, 20))
                          })
)
# create the supplementary and the training data sets
sup_data <- four_pops[sup_ind]
trn_data <- four_pops[-sup_ind]

K-model Performance

  • Now we can built our DAPC model with the training data set
test_model <- dapc(trn_data, grp = trn_grps$grp, n.pc = 6, n.da = 3)
scatter(test_model)

K-model Performance

  • Using the function predict.dapc we can predict the results for our supplementary individuals
pred_mod <- predict.dapc(test_model, newdata = sup_data)
  • pred_mod contains all of the necessary information about the supplementary individuals in relation to the model defined by the training data.
  • This information can be assessed a number of ways to understand the performance of our model.

K-model Performance

  • Now we can visualise the model performance
trn_points <- data.frame(x = test_model$ind.coord[,1],
                         y = test_model$ind.coord[,2],
                         t_points = test_model$grp)
sup_points <- data.frame(x = pred_mod$ind.scores[,1], 
                         y = pred_mod$ind.scores[,2],
                         s_points = pred_mod$assign)
ggplot(trn_points, aes(x = x, y = y)) + 
  geom_point(aes(colour = t_points), size = 10, alpha = 0.2) +
  geom_point(data = sup_points, aes(x = x, y = y, colour = s_points), 
             pch = 15, size = 6) +
  theme_bw()

K-model Performance

K-model Performance

  • Additionally, we can look at the proportions of supplementary individuals correctly assigned to their population of origin
table.value(table(pred_mod$assign, pop(sup_data)),
            col.lab=levels(pop(sup_data)))

Fun time again: Practical 4

Spatially explicit structure inference

Spatially explicit structure inference

  • Although this does not appear to be the case for our brown trout populations, sometimes genetic data alone does not provide the discriminatory power to group individuals unambiguously.
  • Genetic processes often results in spatial structuring
    • Island model (+ve)
    • Hierarchical island model (+ve)
    • Inbreeding avoidance (-ve)
    • Isolation by distance (+ve)

Spatial PCA

  • Implemented in adegenet
  • Uses spatial information (e.g. xy/lat long coordinates/spatial weights) to identify correlated genetic patterns
  • Is genetic variation spatially autocorrelated?
    • Positive relationship suggests individuals are more closely related, genetically, over smaller spatial scales than random
    • Negative relationship suggest the opposite

Spatial PCA in practice

  • Summarise genetic variation into a few variables (good ol' PCA)
  • Construct "composite variables" that are a compromise between variability in both genetic variation and spatial structure
  • Choose which composite variables to interpret
  • Visualise the structure
  • Use sPCA axes to cluster individuals (experimental!)

sPCA: toy example

   #####################
   ### Genind object ### 
   #####################
- genotypes of individuals - 

S4 class:  genind
@call: NULL

@tab:  335 x 55 matrix of genotypes

@ind.names: vector of  335 individual names
@loc.names: vector of  9 locus names
@loc.nall: number of alleles per locus
@loc.fac: locus factor for the  55 columns of @tab
@all.names: list of  9 components yielding allele names for each locus
@ploidy:  2
@type:  codom

Optional contents: 
@pop:  - empty -
@pop.names:  - empty -

@other: a list containing: xy  mnt  showBauges 

sPCA: toy example

sPCA: toy example

  • Let's run a PCA on these data

    scaled_rup <- scaleGen(rupica)
    rup_pca <- dudi.pca(scaled_rup, scale = FALSE, center = FALSE)

sPCA: toy example

  • Clear from the bar plot that no one PC is explaining a large proportion of the total genetic variation

sPCA: toy example

  • Nothing too clear looking at the \(1^{st}\) and \(2^{nd}\) PCs.
  • Most variation explained on the first axis correspond to outliers

sPCA: toy example

sPCA: toy example

  • Using standard PCA there does not appear to be any structuring among the chamois individuals sampled.
  • Now, using additional information (i.e. the geographical relationship among individual), can we refine the distribution of genetic variation?
  • The first step is to define, in some way, the geographical relationships.
    • Chamois home range is roughly a circle with a radius of 1150m

      library("spdep")
      rup_network <- chooseCN(rupica@other$xy, type = 5, d1 = 0, d2 = 2300, 
                      res = "listw")

sPCA: toy example

  • The network looks like this

sPCA: toy example

  • We can run an Moran's I tests to if there is significant spatial autocorrelation within the genetic data
pc1_moran <- moran.mc(x = rup_pca$li[,1], listw = rup_network, nsim = 999)
plot(pc1_moran)
text(x = 0.015, y = 50, paste("p.value = ", pc1_moran$p.value, sep = ""))

sPCA: toy example

  • What about PC2?
pc2_moran <- moran.mc(x = rup_pca$li[,2], listw = rup_network, nsim = 999)
plot(pc2_moran)
text(x = 0.015, y = 20, paste("p.value = ", pc2_moran$p.value, sep = ""))

sPCA: toy example

  • From these tests we can see that PC1 has significant positive spatial autocorrelation, suggesting that spatially close individuals have more similar genotypes.
  • The next challenge is to assess whether there is multivariate spatial autocorrelation.
    • Calculating the Euclidean distance between scaled individual's genotypes (dist)
    • Calculating the Euclidean distance between individual's xy coordinates (dist)
    • Run a Mantel test to test for a correlation between these two distances (mantel.randtest)

sPCA: toy example

  • Mantel test
man_test <- mantel.randtest(m1 = dist(scaled_rup), m2 = dist(rupica@other$xy))
plot(man_test)
text(x = (-0.07), y = 200, paste("p.value = ", man_test$pvalue, sep = ""))

sPCA: toy example

  • Good evidence to support the use of sPCA to extract the spatial patterns within the data

    rup_spca <- spca(obj = rupica, cn = rup_network)
  • We will be shown a bar plot and asked how many first axes we wish to retain, and how many second axes we wish to retain:
    • first axes refer to positive, global structures
    • second axes refer to negative, local structure

sPCA: toy example

  • SPC1 certainly appears to explain a large portion of the spatial genetic structure. SPC2 is questionable.

sPCA: toy example

  • A screeplot of spatial autocorrelation against genetic variance
    • SPC1 stands out, while all other SPCs do not.
    • Suggests all of the spatially meaningful genetic variation is in SPC1
screeplot(rup_spca)

sPCA: toy example

  • Now we can go on to visualise these results on the map (SPC1)

sPCA: toy example

  • Now we can go on to visualise these results on the map (SPC2)

sPCA: toy example

  • Visually, both axes show some interesting spatial patterns.
  • The function colorplot can be used to visualise both axes together
rupica@other$showBauges()
colorplot(rupica$other$xy, rup_spca$ls, axes=1:2, transp=TRUE, 
          add=TRUE, cex = 3)

What next?

  • Obviously, these plotting techniques allow us to visualise genetic structure, but what about classifying individual into discrete groups?
  • Let's experiment with hierarchical clustering
spca_tree <- hclust(dist(rup_spca$ls[,1:2]), method = "ward.D2")
plot(spca_tree)

What next?

What next?

  • Choose a cut point in the tree

What next?

  • Using the following command, we get the group assignment of each individual based on clustering of sPCA axes 1 and 2.

    spca_grps <- cutree(spca_tree, h = 0.3)
  8  63  64  66  67  68  69  70  71  72  74  76  77  78  81  82  83  85  86 162 
  1   2   1   1   1   1   1   1   1   1   1   1   2   2   1   1   1   1   1   1 
  • How do these categorical classifications correspond to the colorplot results?

What next?

What next?

  • What about higher resolution differences?

What next?

Fun time again