diveRsity v1.7.6 update: The problem of bootstrapping bias

1/27/2014

By Kevin Keenan

A new version of diveRsity is now available for download from 'CRAN'. It can also be downloaded using the method described here. While most of the updates in this version are minor, some are more substantial. One of these is the inclusion of an automated bias correction for confidence intervals (CIs) estimated for pairwise differentiation. Below is a coded demonstration of the problem of bias when estimating confidence intervals for diversity/differentiation statistics from bootstrapped data.

The code

Load diveRsity and calculate pairwise differentiation, including \( 95\% \) Confidence Intervals (using 1000 bootstrap iterations).

 # load diveRsity library(diveRsity)
# run fastDivPart on "Test_data" (1000 bootstraps)
data(Test_data)
system.time({
  pwDiff <- fastDivPart(Test_data, pairwise = TRUE,
                        bs_pairwise = TRUE, WC_Fst = TRUE,
                        bootstrap = 1000, parallel = TRUE)
})

    user  system elapsed 
  2.810   0.054 456.167

Since Test_data contains genotype data for six population samples, fastDivPart will calculate confidence intervals for 15 separate pairwise estimates of \( G_{ST} \) (Nei & Chesser, 1983), \( G^{'}_{ST} \) (Hedrick, 2005), \( D_{Jost} \) (Jost, 2008) and \( \theta \) (Weir & Cockerham, 1984). We can plot these point estimates along with CIs as follows (N.B. by setting the argument plot = TRUE in the original call, these results will be plotted automatically).

Plotting the results

 # View the pairwise results data
names(pwDiff)

 [1] "standard"     "estimate"     "pairwise"     "meanPairwise" "bs_pairwise"

We can see that the pairwise bootstrap results are located at bs_pairwise. This object within pwDiff contains separate data tables for each of the four statistics listed above. All data tables have the same structure, so we can just look at one, \( D_{Jost} \).

 pwDiff$bs_pairwise$djostEst

                       actual      mean    BC_mean Lower_95%CI Upper_95%CI BC_Lower_95%CI BC_Upper_95%CI
POP1_1 vs. POP2_1  0.6002342 0.5850703  0.6002342   5.169e-01    0.647625      0.5320938      0.6627891
POP1_1 vs. POP3_1  0.0006103 0.0011152  0.0006103   4.645e-06    0.003614     -0.0005003      0.0031092
POP1_1 vs. POP4_1 -0.0002870 0.0002585 -0.0002870  -7.567e-05    0.001304     -0.0006211      0.0007581
POP2_1 vs. POP3_1  0.5903311 0.5742200  0.5903311   5.086e-01    0.635704      0.5246634      0.6518151
POP2_1 vs. POP4_1  0.5933083 0.5783549  0.5933083   5.139e-01    0.641400      0.5288839      0.6563533
POP3_1 vs. POP4_1  0.0010899 0.0018571  0.0010899   3.038e-04    0.004484     -0.0004634      0.0037167

We can see that the output for each of the four statistics estimates contains seven data columns. We are interested in comparing the difference between the standard CIs and the bias corrected CIs.

 # plot using ggplot2
library(ggplot2)
# create a djost dataframe from our pwDiff results to reduce typing
djost <- as.data.frame(pwDiff$bs_pairwise$djostEst)
# The column names cause problems, so we can change them as follows:
colnames(djost) <- c("actual", "mean", "bcMean", "bucLow", "bucUp", "bcLow",                       "bcUp")
# plot the point estimates for each of your four pw comparisons
# with uncorrected and corrected 95% CIs
CI_type <- as.factor(c(rep("uc", nrow(djost)), rep("bc", nrow(djost))))
ggplot(djost, aes(x = c(1:nrow(djost),1:nrow(djost) + 0.5), 
                  y = c(actual, actual), colour = CI_type)) +
  geom_errorbar(aes(ymin = c(bucLow, bcLow), ymax = c(bucUp, bcUp)), 
                width = 0.3) +
  geom_point() +
  ylab(expression(D["Jost"])) +
  xlab("Pairwise Comparison")

plot of chunk plt3

A figure demonstrating the difference in bias corrected and uncorrected \( 95\% CIs \) derived from 1000 bootstrap iterations. Blue lines represent the uncorrected CI, while pink lines represent the bias corrected CIs.

Of particular note is the fact that often the uncorrected CIs don't even encompass the sample estimates. This occurs as a result of the known upward bias associated with bootstrapping these types of statistics. Another important point of notes is the increased risk of type I error seen for two of the pairwise comparisons when using uncorrected CIs. In these comparisons, the uncorrected CIs do not overlap \( 0 \), suggesting differentiation between the two populations is statistically significant. However, following bias correction, these pairwise differentiation estimates are no longer significantly different from \( 0 \) (i.e. they overlap \( 0 \)).

From this example, it is clear that correcting for bias in bootstrapped differentiation estimates is important. In line with this, diveRsity now provides automatic bias correction in the fastDivPart function.

N.B. divPart does not have this capability since it is no longer supported.

References

Hedrick, P. (2005) A standardized genetic diﬀerentiation measure. Evolution, 59, 1633–1638
Jost, L. (2008) GST and its relatives do not measure diﬀerentiation. Molecular Ecology, 17, 4015–4026
Jost, L. (2008) GST and its relatives do not measure diﬀerentiation. Molecular Ecology, 17, 4015–4026
Keenan, K., McGinnity, P., Cross, T. F., Crozier, W. W., Prodöhl, P. A. (2013), diveRsity: An R package for the estimation and exploration of population genetics parameters and their associated errors. Methods in Ecology and Evolution, 4: 782–788. doi: 10.1111/2041-210X.12067
Weir, B. & Cockerham, C. (1984) Estimating F-statistics for the analysis of population structure. Evolution, 38, 1358–1370

1 Comment