These tasks generally involve the simulation of hundreds, thousands or even millions of data sets, and the subsequent estimation of a range of summary statistics from each of these. Depending on the number of data sets involved, the computational effort might be massive.
In the case of the emailer, parallelisation seemed like a fast solution to the problem, which would allow the tasked to be scaled from a desktop to a cluster environment if needs be. The below psudo-code demonstrates the use of the 'divPart' function to calculate global diversity statistics for imaginary batches of files (e.g. assume the files being analyses are separated into 10 separate folders which represent 10 separate simulation scenarios perhaps, with each folder containing 1000 genepop files), using R.
###########################################################
# Running divPart on batches of genepop files in parallel #
###########################################################
# load the diveRsity package
library("diveRsity")
# specify the top level directory under which all 'simulation'
# folders are located, by setting the working directory to it.
setwd("~/simulations")
# The directory tree might look like this:
simulations
|
---- simulation-1
| |
| ---- file1.gen
| |
| ---- file2.gen
| |
| ---- file3.gen
| ---- simulation-2
| |
| ---- file1.gen
| |
| ---- file2.gen
| |
| ---- file3.gen
| ---- simulation-3
| |
| ---- file1.gen
| |
| ---- file2.gen
| |
| ---- file3.gen
# and so on. # Next we can specify the names of our simulation folders in two ways
# manually
fold_names <- paste("simulation", 1:10, sep = "-")
# automatically (when there is only a single level below the
# top directory)
fold_names <- list.dirs(full.names = TRUE, recursive = FALSE)
# Now we can determine the names of all genepop files in each folder
file_names <- lapply(fold_names, function(x){
files <- dir(path = paste(x, "/", sep = ""), pattern = "*.gen",
full.names = TRUE)
return(files) })
# file_names will be a list of length 10. Each element will contain
# the names of all .gen files within the respective simulation folder
# Before we are ready to run the main analyses, we should set up
# the parallel environment
# load the doParallel package
library("doParallel")
# set up a cluster of 10 CPUs (one for each batch of files)
cl <- makeCluster(10)
# Export the 'divPart' function to the cluster cores
clusterExport(cl, "divPart", envir = environment())
# Now we can run the main analyses
results <- parLapply(cl, file_names, function(x){
sim_res <- sapply(x, function(y){
out <- divPart(infile = y, gp = 3, WC_Fst = TRUE)
return(out$estimate[nrow(out$estimate), 4:7])
})
return(t(sim_res)) # transpose sim_res
})
# This will generate a list (of length 10), with each element containing
# a matrix of 1000 rows (1 per file) and 4 columns (1 for each diversity
# statistic
# example of output for simulation 1
G_st_est G_hed_st_est D_Jost_est Fst_WC
0.3905 0.8938 0.8256 0.4010
0.5519 0.8719 0.6986 0.6031
0.5924 0.8880 0.7092 0.6096
... ... ... ...
... ... ... ...
# these results could then be piped to further analyses or visualisation
# tools