A new version of diveRsity
is now available for download from 'CRAN'. It can also be downloaded using the method described here. While most of the updates in this version are minor, some are more substantial. One of these is the inclusion of an automated bias correction for confidence intervals (CIs) estimated for pairwise differentiation. Below is a coded demonstration of the problem of bias when estimating confidence intervals for diversity/differentiation statistics from bootstrapped data.
Load diveRsity
and calculate pairwise differentiation, including \( 95\% \) Confidence Intervals (using 1000 bootstrap iterations).
# load diveRsity library(diveRsity)
# run fastDivPart on "Test_data" (1000 bootstraps)
data(Test_data)
system.time({
pwDiff < fastDivPart(Test_data, pairwise = TRUE,
bs_pairwise = TRUE, WC_Fst = TRUE,
bootstrap = 1000, parallel = TRUE)
})
user system elapsed
2.810 0.054 456.167
Since Test_data
contains genotype data for six population samples, fastDivPart
will calculate confidence intervals for 15 separate pairwise estimates of \( G_{ST} \) (Nei & Chesser, 1983), \( G^{'}_{ST} \) (Hedrick, 2005), \( D_{Jost} \) (Jost, 2008) and \( \theta \) (Weir & Cockerham, 1984). We can plot these point estimates along with CIs as follows (N.B. by setting the argument plot = TRUE
in the original call, these results will be plotted automatically).
# View the pairwise results data
names(pwDiff)
[1] "standard" "estimate" "pairwise" "meanPairwise" "bs_pairwise"
We can see that the pairwise bootstrap results are located at bs_pairwise
. This object within pwDiff
contains separate data tables for each of the four statistics listed above. All data tables have the same structure, so we can just look at one, \( D_{Jost} \).
pwDiff$bs_pairwise$djostEst
actual mean BC_mean Lower_95%CI Upper_95%CI BC_Lower_95%CI BC_Upper_95%CI
POP1_1 vs. POP2_1 0.6002342 0.5850703 0.6002342 5.169e01 0.647625 0.5320938 0.6627891
POP1_1 vs. POP3_1 0.0006103 0.0011152 0.0006103 4.645e06 0.003614 0.0005003 0.0031092
POP1_1 vs. POP4_1 0.0002870 0.0002585 0.0002870 7.567e05 0.001304 0.0006211 0.0007581
POP2_1 vs. POP3_1 0.5903311 0.5742200 0.5903311 5.086e01 0.635704 0.5246634 0.6518151
POP2_1 vs. POP4_1 0.5933083 0.5783549 0.5933083 5.139e01 0.641400 0.5288839 0.6563533
POP3_1 vs. POP4_1 0.0010899 0.0018571 0.0010899 3.038e04 0.004484 0.0004634 0.0037167
We can see that the output for each of the four statistics estimates contains seven data columns. We are interested in comparing the difference between the standard CIs and the bias corrected CIs.
# plot using ggplot2
library(ggplot2)
# create a djost dataframe from our pwDiff results to reduce typing
djost < as.data.frame(pwDiff$bs_pairwise$djostEst)
# The column names cause problems, so we can change them as follows:
colnames(djost) < c("actual", "mean", "bcMean", "bucLow", "bucUp", "bcLow", "bcUp")
# plot the point estimates for each of your four pw comparisons
# with uncorrected and corrected 95% CIs
CI_type < as.factor(c(rep("uc", nrow(djost)), rep("bc", nrow(djost))))
ggplot(djost, aes(x = c(1:nrow(djost),1:nrow(djost) + 0.5),
y = c(actual, actual), colour = CI_type)) +
geom_errorbar(aes(ymin = c(bucLow, bcLow), ymax = c(bucUp, bcUp)),
width = 0.3) +
geom_point() +
ylab(expression(D["Jost"])) +
xlab("Pairwise Comparison")
A figure demonstrating the difference in bias corrected and uncorrected \( 95\% CIs \) derived from 1000 bootstrap iterations. Blue lines represent the uncorrected CI, while pink lines represent the bias corrected CIs.
Of particular note is the fact that often the uncorrected CIs don't even encompass the sample estimates. This occurs as a result of the known upward bias associated with bootstrapping these types of statistics. Another important point of notes is the increased risk of type I error seen for two of the pairwise comparisons when using uncorrected CIs. In these comparisons, the uncorrected CIs do not overlap \( 0 \), suggesting differentiation between the two populations is statistically significant. However, following bias correction, these pairwise differentiation estimates are no longer significantly different from \( 0 \) (i.e. they overlap \( 0 \)).
From this example, it is clear that correcting for bias in bootstrapped differentiation estimates is important. In line with this, diveRsity
now provides automatic bias correction in the fastDivPart
function.
N.B. divPart
does not have this capability since it is no longer supported.
Hedrick, P. (2005) A standardized genetic diﬀerentiation measure. Evolution, 59, 1633–1638
Jost, L. (2008) GST and its relatives do not measure diﬀerentiation. Molecular Ecology, 17, 4015–4026
Jost, L. (2008) GST and its relatives do not measure diﬀerentiation. Molecular Ecology, 17, 4015–4026
Keenan, K., McGinnity, P., Cross, T. F., Crozier, W. W., Prodöhl, P. A. (2013), diveRsity: An R package for the estimation and exploration of population genetics parameters and their associated errors. Methods in Ecology and Evolution, 4: 782–788. doi: 10.1111/2041210X.12067
Weir, B. & Cockerham, C. (1984) Estimating Fstatistics for the analysis of population structure. Evolution, 38, 1358–1370
The apply
family of functions in R
are extremely useful. I've been using them for quite a while now, generally in place of for
loops. However, they are not particulary intuative for R
beginners, in the same way that loops can be.
One apply
function that I have never paid much attention to in the past is mapply
. I've attempted to use it a few times but could never make sense of the help file and just resorted to loops instead. This morning, however, I was trying to calculate some statistics from the independent element of two lists
that I had generated, and was determined to avoid using a for
loop (my default position when writing R
code).
A quick google search suggested that mapply was the way to go. After some fumbling around and lots of trial and error, the scales dropped from my eyes as I held 'CTRL+ENTER' (in RStudio of course) and the stop icon dissappeared as if it was never there. Previously, when running similar calculations using for
loops, the stop icon might have remained tauntingly for up to a minute, maybe more.
It appears that mapply
is not only easier to use than I previously thought, but also lightening fast. Let the code below be a testament to its power:
For the purpose of illustration, imagine we have two lists of length 100,000, each element being a matrix of 100 random variables with 10 columns and 10 rows.
Imagine that we are interested in calculating the product of each matrix (i.e. \( mat1 \times mat2 \)). Let's have a look at the speed difference between using a for loop and mapply
.
# generate the data
list1 < list()
list2 < list()
for (i in 1:1e+05) {
list1[[i]] < matrix(rnorm(100), ncol = 10)
list2[[i]] < matrix(rnorm(100), ncol = 10)
}
# Calculate the matrix products using a 'for' loop
system.time({
listProd1 < list()
for (i in 1:1e+05) {
listProd1[[i]] < list1[[i]] * list2[[i]]
}
})
## user system elapsed
## 32.33 1.34 34.31
# Calculate the matrix products using 'mapply'
system.time({
listProd2 < mapply(FUN = `*`, list1, list2, SIMPLIFY = FALSE)
})
## user system elapsed
## 0.34 0.03 0.38
# Test to make sure both methods do the same thing
listProd1[[1]] == listProd2[[1]]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [8,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [9,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [10,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
listProd1[[1000]] == listProd2[[1000]]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [8,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [9,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [10,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
We can see that there is a massive (in computation terms) difference in the performance of these two methods. Although I don't know for sure, I suspect the time penalties in the for
loop are due to growing the list from scratch which takes time, and is not the best way to do things.
## R Under development (unstable) (20130929 r64014)
## Platform: x86_64w64mingw32/x64 (64bit)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.5.1
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.3 evaluate_0.4.7 formatR_0.9 stringr_0.6.2
## [5] tools_3.1.0
if (gp == 3) {
plMake < function(x) {
matrix(sprintf("%06g", as.numeric(x)),
nrow = nrow(x), ncol = ncol(x))
}
} else if (gp == 2) {
plMake < function(x) {
matrix(sprintf("%04g", as.numeric(x)),
nrow = nrow(x), ncol = ncol(x))
}
}
if (gp == 3) {
plMake < function(x) {
out < matrix(sprintf("%06g", as.numeric(x)),
nrow = nrow(x), ncol = ncol(x))
if (Sys.info()["sysname"] == "Darwin") {
out[out == "0000NA"] < " NA"
}
return(out)
}
} else if (gp == 2) {
plMake < function(x) {
out < matrix(sprintf("%04g", as.numeric(x)),
nrow = nrow(x), ncol = ncol(x))
if (Sys.info()["sysname"] == "Darwin") {
out[out == "00NA"] < " NA"
}
return(out)
}
}
###########################################################
# Running divPart on batches of genepop files in parallel #
###########################################################
# load the diveRsity package
library("diveRsity")
# specify the top level directory under which all 'simulation'
# folders are located, by setting the working directory to it.
setwd("~/simulations")
# The directory tree might look like this:
simulations

 simulation1
 
  file1.gen
 
  file2.gen
 
  file3.gen
  simulation2
 
  file1.gen
 
  file2.gen
 
  file3.gen
  simulation3
 
  file1.gen
 
  file2.gen
 
  file3.gen
# and so on. # Next we can specify the names of our simulation folders in two ways
# manually
fold_names < paste("simulation", 1:10, sep = "")
# automatically (when there is only a single level below the
# top directory)
fold_names < list.dirs(full.names = TRUE, recursive = FALSE)
# Now we can determine the names of all genepop files in each folder
file_names < lapply(fold_names, function(x){
files < dir(path = paste(x, "/", sep = ""), pattern = "*.gen",
full.names = TRUE)
return(files) })
# file_names will be a list of length 10. Each element will contain
# the names of all .gen files within the respective simulation folder
# Before we are ready to run the main analyses, we should set up
# the parallel environment
# load the doParallel package
library("doParallel")
# set up a cluster of 10 CPUs (one for each batch of files)
cl < makeCluster(10)
# Export the 'divPart' function to the cluster cores
clusterExport(cl, "divPart", envir = environment())
# Now we can run the main analyses
results < parLapply(cl, file_names, function(x){
sim_res < sapply(x, function(y){
out < divPart(infile = y, gp = 3, WC_Fst = TRUE)
return(out$estimate[nrow(out$estimate), 4:7])
})
return(t(sim_res)) # transpose sim_res
})
# This will generate a list (of length 10), with each element containing
# a matrix of 1000 rows (1 per file) and 4 columns (1 for each diversity
# statistic
# example of output for simulation 1
G_st_est G_hed_st_est D_Jost_est Fst_WC
0.3905 0.8938 0.8256 0.4010
0.5519 0.8719 0.6986 0.6031
0.5924 0.8880 0.7092 0.6096
... ... ... ...
... ... ... ...
# these results could then be piped to further analyses or visualisation
# tools
# This function will look for all instances of files containing the pattern
# specified in the argument 'patterns', and copy them to a single directory
# named 'fileSort[out]' under the working directory.
# Source files are unmodified
# Written by Kevin Keenan 2013
# Feel free to use, modify and redistribute as you wish
fileSort < function(patterns = NULL, new.location = getwd()){
ptn < patterns
# Define the file copier/mover function
cpmvFILE < function(dirs = NULL, patterns = NULL){
if(!is.null(dirs)){
dirs = dirs
patterns = patterns
root < getwd()
sapply(dirs, function(x){
names < dir(path = x, pattern = patterns, ignore.case = TRUE)
if(length(names) != 0 ){
sapply(names, function(y){
file.copy(from = paste(x, "/", y, sep = ""),
to = paste(new.location, "/", y, sep = ""),
overwrite = FALSE, recursive = FALSE)
})
}
})
}
}
# list the first level of directories
dirsIn < list.dirs(full.names = TRUE, recursive = TRUE)
# remove the first directory when path is given
# Create a directory into which all relevent files will be written
dir.create(new.location, sep = ""), showWarnings = FALSE)
# run cpmvFILE, assigning res to x to prevent printout
x < cpmvFILE(dirs = dirsIn, patterns = ptn)
# remove x
rm(x)
}
divBasic(infile = NULL, outfile = NULL, gp = 3)
divOnline() # only works if diveRsity is loaded
# or
diveRsity :: divOnline # works if diveRsity is not loaded
div.part(infile = NULL, outfile = NULL, gp = 3,
pairwise = FALSE, WC_Fst = FALSE,
bs_locus = FALSE, bs_pairwise = FALSE,
bootstraps = 0, Plot = FALSE, parallel = FALSE)
Error in if(pTotal == 0){ : missing value where TRUE/FALSE needed
if(pTotal == 0){
afTab < function(x){
apply(x, 2, table)
}
afTab < function(x){
lapply(x, function(i){
return(table(x[,i]))
})
}