Diversity in Life
  • Home
  • Diversity in Life Blog
  • Software
  • diveRsity online
  • R-peer-grp

Batch processing genepop files with diveRsity.

3/31/2013

0 Comments

 
By Kevin Keenan
Picture
A few months back I was sent an email by a useR of the the diveRsity package, asking a question which involved calculating mean pairwise parameters from hundreds of separate genepop files. It was an interesting problem that I had never considered before. The more I thought about the problem, the more I realised how important this task might be in simulation based studies such as Approximate Bayesian Computation (ABC).

These tasks generally involve the simulation of hundreds, thousands or even millions of data sets, and the subsequent estimation of a range of summary statistics from each of these. Depending on the number of data sets involved, the computational effort might be massive. 

In the case of the emailer, parallelisation seemed like a fast solution to the problem, which would allow the tasked to be scaled from a desktop to a cluster environment if needs be. The below psudo-code demonstrates the use of the 'divPart' function to calculate global diversity statistics for imaginary batches of files (e.g. assume the files being analyses are separated into 10 separate folders which represent 10 separate simulation scenarios perhaps, with each folder containing 1000 genepop files), using R.
  ########################################################### 
# Running divPart on batches of genepop files in parallel #
###########################################################
# load the diveRsity package
library("diveRsity")

# specify the top level directory under which all 'simulation'
# folders are located, by setting the working directory to it.

setwd("~/simulations")

# The directory tree might look like this:

simulations
|
---- simulation-1
| |
| ---- file1.gen
| |
| ---- file2.gen
| |
| ---- file3.gen
| ---- simulation-2
| |
| ---- file1.gen
| |
| ---- file2.gen
| |
| ---- file3.gen
| ---- simulation-3
| |
| ---- file1.gen
| |
| ---- file2.gen
| |
| ---- file3.gen

# and so on. # Next we can specify the names of our simulation folders in two ways
# manually
fold_names <- paste("simulation", 1:10, sep = "-")

# automatically (when there is only a single level below the
# top directory)
fold_names <- list.dirs(full.names = TRUE, recursive = FALSE)

# Now we can determine the names of all genepop files in each folder

file_names <- lapply(fold_names, function(x){
files <- dir(path = paste(x, "/", sep = ""), pattern = "*.gen",
full.names = TRUE)
return(files) })

# file_names will be a list of length 10. Each element will contain
# the names of all .gen files within the respective simulation folder

# Before we are ready to run the main analyses, we should set up
# the parallel environment

# load the doParallel package
library("doParallel")
# set up a cluster of 10 CPUs (one for each batch of files)
cl <- makeCluster(10)
# Export the 'divPart' function to the cluster cores
clusterExport(cl, "divPart", envir = environment())

# Now we can run the main analyses

results <- parLapply(cl, file_names, function(x){
sim_res <- sapply(x, function(y){
out <- divPart(infile = y, gp = 3, WC_Fst = TRUE)
return(out$estimate[nrow(out$estimate), 4:7])
})
return(t(sim_res)) # transpose sim_res
})

# This will generate a list (of length 10), with each element containing
# a matrix of 1000 rows (1 per file) and 4 columns (1 for each diversity
# statistic
# example of output for simulation 1
G_st_est G_hed_st_est D_Jost_est Fst_WC
0.3905 0.8938 0.8256 0.4010
0.5519 0.8719 0.6986 0.6031
0.5924 0.8880 0.7092 0.6096
... ... ... ...
... ... ... ...

# these results could then be piped to further analyses or visualisation
# tools
If any of the command usage here is not clear or incorrect, please don't hesitate to contact me (either by commenting or via email). You will find my email in the diveRsity package documentation.
0 Comments



Leave a Reply.

    Tweet

    About the authors

    Kevin Keenan is currently working towards a PhD in population and evolutionary genetics.  
    His general research interests include; small scale intraspecific genetic divergence, speciation and phylogeography.

    Uncle Mick is currently reading genetics at the University of Manchester.
    His interests include; the role of non-coding RNA in regulatory processes, molecular processes associated with sequencing technology, and the interactions between genotype and phenotype.

    Categories

    All
    Analysis
    Ancient Dna
    Cell Biology
    Conservation
    Evolution
    Fish Genetics
    Northern Ireland
    Programming
    R
    Religion

    Blogroll

    R-bloggers

    Archives

    January 2014
    October 2013
    September 2013
    May 2013
    April 2013
    March 2013
    February 2013
    January 2013
    December 2012
    October 2012
    August 2012
    July 2012

    RSS Feed

Powered by Create your own unique website with customizable templates.