Display options
Share it on

Bioinformatics. 2013 Nov 15;29(22):2877-83. doi: 10.1093/bioinformatics/btt480. Epub 2013 Aug 19.

A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis.

Bioinformatics (Oxford, England)

Sarah E Reese, Kellie J Archer, Terry M Therneau, Elizabeth J Atkinson, Celine M Vachon, Mariza de Andrade, Jean-Pierre A Kocher, Jeanette E Eckel-Passow

Affiliations

  1. Department of Biostatistics, Biostatistics Shared Resource Core, VCU Massey Cancer Center, Virginia Commonwealth University, Richmond, VA 23284, USA, Division of Biomedical Statistics and Informatics and Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA.

PMID: 23958724 PMCID: PMC3810845 DOI: 10.1093/bioinformatics/btt480

Abstract

MOTIVATION: Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data.

RESULTS: We present an extension of PCA to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test whether a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays, whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies.

CONCLUSION: We developed a new statistic that uses gPCA to identify whether batch effects exist in high-throughput genomic data. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well.

AVAILABILITY AND IMPLEMENTATION: The gPCA R package (Available via CRAN) provides functionality and data to perform the methods in this article.

CONTACT: [email protected]

References

  1. PLoS One. 2011 Feb 28;6(2):e17238 - PubMed
  2. Bioinformatics. 2010 Jan 15;26(2):242-9 - PubMed
  3. PLoS Genet. 2007 Sep;3(9):1724-35 - PubMed
  4. Pharmacogenomics J. 2010 Aug;10(4):278-91 - PubMed
  5. PLoS One. 2008;3(11):e3724 - PubMed
  6. Biostatistics. 2010 Apr;11(2):242-53 - PubMed
  7. Proc Natl Acad Sci U S A. 2008 Dec 2;105(48):18718-23 - PubMed
  8. Bioinformatics. 2012 Mar 15;28(6):882-3 - PubMed
  9. Front Genet. 2012 Feb 24;3:11 - PubMed
  10. Artif Intell Med. 2004 Jun;31(2):91-103 - PubMed
  11. Pac Symp Biocomput. 2011;:142-53 - PubMed
  12. Genet Epidemiol. 1995;12(4):417-29 - PubMed
  13. Biostatistics. 2007 Jan;8(1):118-27 - PubMed
  14. Bioinformatics. 2004 Jan 1;20(1):105-14 - PubMed
  15. PLoS One. 2011 Mar 29;6(3):e18202 - PubMed
  16. Bioinformatics. 2012 Apr 15;28(8):1182-3 - PubMed
  17. Brief Bioinform. 2013 Jul;14(4):469-90 - PubMed
  18. BMC Med Genomics. 2011 Dec 16;4:84 - PubMed
  19. BMC Med Genomics. 2008 Sep 21;1:42 - PubMed
  20. Genet Epidemiol. 2010 Sep;34(6):591-602 - PubMed

MeSH terms

Publication Types

Grant support