Batch processing genepop files with diveRsity.

3/31/2013

By Kevin Keenan

A few months back I was sent an email by a useR of the the diveRsity package, asking a question which involved calculating mean pairwise parameters from hundreds of separate genepop files. It was an interesting problem that I had never considered before. The more I thought about the problem, the more I realised how important this task might be in simulation based studies such as Approximate Bayesian Computation (ABC).

These tasks generally involve the simulation of hundreds, thousands or even millions of data sets, and the subsequent estimation of a range of summary statistics from each of these. Depending on the number of data sets involved, the computational effort might be massive.

In the case of the emailer, parallelisation seemed like a fast solution to the problem, which would allow the tasked to be scaled from a desktop to a cluster environment if needs be. The below psudo-code demonstrates the use of the 'divPart' function to calculate global diversity statistics for imaginary batches of files (e.g. assume the files being analyses are separated into 10 separate folders which represent 10 separate simulation scenarios perhaps, with each folder containing 1000 genepop files), using R.

  ########################################################### 
# Running divPart on batches of genepop files in parallel # 
########################################################### 
 # load the diveRsity package 
library("diveRsity")
 
 # specify the top level directory under which all 'simulation' 
# folders are located, by setting the working directory to it. 

setwd("~/simulations")

# The directory tree might look like this: 

simulations
|
---- simulation-1
|    |
|    ---- file1.gen
|    |
|    ---- file2.gen
|    |
|    ---- file3.gen
| ---- simulation-2
|    |
|    ---- file1.gen
|    |
|    ---- file2.gen
|    |
|    ---- file3.gen
| ---- simulation-3
|    |
|    ---- file1.gen
|    |
|    ---- file2.gen
|    |
|    ---- file3.gen

# and so on.  # Next we can specify the names of our simulation folders in two ways 
   # manually 
fold_names <- paste("simulation", 1:10, sep = "-")

# automatically (when there is only a single level below the
# top directory)
fold_names <- list.dirs(full.names = TRUE, recursive = FALSE) 

# Now we can determine the names of all genepop files in each folder

file_names <- lapply(fold_names, function(x){
  files <- dir(path = paste(x, "/", sep = ""), pattern = "*.gen",
               full.names = TRUE)
  return(files) })

# file_names will be a list of length 10. Each element will contain
# the names of all .gen files within the respective simulation folder

# Before we are ready to run the main analyses, we should set up 
# the parallel environment

# load the doParallel package
library("doParallel")
# set up a cluster of 10 CPUs (one for each batch of files)
cl <- makeCluster(10)
# Export the 'divPart' function to the cluster cores 
clusterExport(cl, "divPart", envir = environment())

# Now we can run the main analyses

results <- parLapply(cl, file_names, function(x){
  sim_res <- sapply(x, function(y){
    out <- divPart(infile = y, gp = 3, WC_Fst = TRUE)
    return(out$estimate[nrow(out$estimate), 4:7])
  })
  return(t(sim_res))  # transpose sim_res
})

# This will generate a list (of length 10), with each element containing
# a matrix of 1000 rows (1 per file) and 4 columns (1 for each diversity
# statistic
 # example of output for simulation 1
G_st_est        G_hed_st_est    D_Jost_est      Fst_WC
0.3905          0.8938          0.8256          0.4010
0.5519          0.8719          0.6986          0.6031
0.5924          0.8880          0.7092          0.6096
...             ...             ...             ...
...             ...             ...             ...

# these results could then be piped to further analyses or visualisation 
# tools

If any of the command usage here is not clear or incorrect, please don't hesitate to contact me (either by commenting or via email). You will find my email in the diveRsity package documentation.

0 Comments

Batch processing genepop files with diveRsity.

Leave a Reply.

About the authors

Categories

Blogroll

Archives