Display options
Share it on

HGG Adv. 2021 Apr 08;2(2). doi: 10.1016/j.xhgg.2020.100019. Epub 2021 Jan 05.

Transcriptome prediction performance across machine learning models and diverse ancestries.

HGG advances

Paul C Okoro, Ryan Schubert, Xiuqing Guo, W Craig Johnson, Jerome I Rotter, Ina Hoeschele, Yongmei Liu, Hae Kyung Im, Amy Luke, Lara R Dugas, Heather E Wheeler

Affiliations

  1. Program in Bioinformatics, Loyola University Chicago, Chicago, IL, USA.
  2. Department of Mathematics and Statistics, Loyola University Chicago, Chicago, IL, USA.
  3. Institute for Translational Genomics and Population Sciences, The Lundquist Institute and Department of Pediatrics at Harbor-UCLA Medical Center, Torrance, CA, USA.
  4. Department of Biostatistics, University of Washington, Seattle, WA, USA.
  5. Fralin Life Sciences Institute, Virginia Tech, Blacksburg, VA, USA.
  6. Department of Statistics, Virginia Tech, Blacksburg, VA, USA.
  7. Wake Forest School of Medicine, Winston-Salem, NC, USA.
  8. Department of Medicine, Duke University School of Medicine, Durham, NC, USA.
  9. Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA.
  10. Department of Public Health Sciences, Parkinson School of Health Sciences and Public Health, Loyola University Chicago, Maywood, IL, USA.
  11. Department of Human Biology, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa.
  12. Department of Biology, Loyola University Chicago, Chicago, IL, USA.
  13. Department of Computer Science, Loyola University Chicago, Chicago, IL, USA.

PMID: 33937878 PMCID: PMC8087249 DOI: 10.1016/j.xhgg.2020.100019

Abstract

Transcriptome prediction methods such as PrediXcan and FUSION have become popular in complex trait mapping. Most transcriptome prediction models have been trained in European populations using methods that make parametric linear assumptions like the elastic net (EN). To potentially further optimize imputation performance of gene expression across global populations, we built transcriptome prediction models using both linear and non-linear machine learning (ML) algorithms and evaluated their performance in comparison to EN. We trained models using genotype and blood monocyte transcriptome data from the Multi-Ethnic Study of Atherosclerosis (MESA) comprising individuals of African, Hispanic, and European ancestries and tested them using genotype and whole-blood transcriptome data from the Modeling the Epidemiology Transition Study (METS) comprising individuals of African ancestries. We show that the prediction performance is highest when the training and the testing population share similar ancestries regardless of the prediction algorithm used. While EN generally outperformed random forest (RF), support vector regression (SVR), and K nearest neighbor (KNN), we found that RF outperformed EN for some genes, particularly between disparate ancestries, suggesting potential robustness and reduced variability of RF imputation performance across global populations. When applied to a high-density lipoprotein (HDL) phenotype, we show including RF prediction models in PrediXcan revealed potential gene associations missed by EN models. Therefore, by integrating other ML modeling into PrediXcan and diversifying our training populations to include more global ancestries, we may uncover new genes associated with complex traits.

Conflict of interest statement

Declaration of interests The authors declare no competing interests.

References

  1. Am J Hum Genet. 2012 Jan 13;90(1):7-24 - PubMed
  2. J Stat Softw. 2010;33(1):1-22 - PubMed
  3. Nat Genet. 2016 Oct;48(10):1279-83 - PubMed
  4. Genet Epidemiol. 2020 Jul;44(5):425-441 - PubMed
  5. PLoS Genet. 2013;9(2):e1003264 - PubMed
  6. Nat Genet. 2016 Oct;48(10):1284-1287 - PubMed
  7. Am J Hum Genet. 2017 Apr 6;100(4):635-649 - PubMed
  8. Nature. 2017 Oct 11;550(7675):204-213 - PubMed
  9. Circ Res. 2018 Jan 5;122(1):106-112 - PubMed
  10. Nat Protoc. 2012 Feb 16;7(3):500-7 - PubMed
  11. Genome Res. 2012 Sep;22(9):1760-74 - PubMed
  12. Nucleic Acids Res. 2017 Jan 4;45(D1):D896-D901 - PubMed
  13. Nat Genet. 2015 Sep;47(9):1091-8 - PubMed
  14. Nat Methods. 2017 Apr;14(4):417-419 - PubMed
  15. BMC Public Health. 2014 Apr 24;14:397 - PubMed
  16. Nat Genet. 2016 Mar;48(3):245-52 - PubMed
  17. Genet Epidemiol. 2015 May;39(4):276-93 - PubMed
  18. Nat Commun. 2017 Sep 6;8(1):456 - PubMed
  19. Nat Commun. 2018 May 8;9(1):1825 - PubMed
  20. Hum Mol Genet. 2013 Dec 15;22(24):5065-74 - PubMed
  21. Gigascience. 2015 Feb 25;4:7 - PubMed
  22. PLoS Genet. 2013 Mar;9(3):e1003396 - PubMed
  23. Am J Hum Genet. 2019 Aug 1;105(2):258-266 - PubMed
  24. Genet Epidemiol. 2020 Mar;44(2):125-138 - PubMed
  25. Atherosclerosis. 2003 Apr;167(2):195-204 - PubMed
  26. J Genet Genomics. 2015 Mar 20;42(3):87-98 - PubMed
  27. PLoS Genet. 2020 Aug 14;16(8):e1008927 - PubMed
  28. Front Genet. 2019 Apr 03;10:261 - PubMed
  29. Am J Epidemiol. 2002 Nov 1;156(9):871-81 - PubMed
  30. Nat Genet. 2018 Jul;50(7):956-967 - PubMed
  31. Bioinformatics. 2019 Dec 15;35(24):5346-5348 - PubMed
  32. BMC Public Health. 2011 Dec 14;11:927 - PubMed
  33. Nat Genet. 2019 Jan;51(1):187-195 - PubMed
  34. Nature. 2015 Oct 1;526(7571):68-74 - PubMed
  35. Clin Med Insights Cardiol. 2016 Mar 13;10:37-42 - PubMed
  36. PLoS Genet. 2018 Aug 10;14(8):e1007586 - PubMed
  37. Nat Rev Genet. 2008 May;9(5):356-69 - PubMed
  38. Nat Genet. 2016 Nov;48(11):1443-1448 - PubMed
  39. PLoS Genet. 2016 Nov 11;12(11):e1006423 - PubMed
  40. Nat Genet. 2006 Aug;38(8):904-9 - PubMed
  41. Am J Hum Genet. 2016 Apr 7;98(4):697-708 - PubMed
  42. Bioinformatics. 2010 Nov 15;26(22):2867-73 - PubMed
  43. BMC Public Health. 2017 May 12;17(1):438 - PubMed
  44. Proc Natl Acad Sci U S A. 2010 May 18;107(20):9287-92 - PubMed
  45. Genet Epidemiol. 2020 Sep 10;: - PubMed
  46. PLoS Genet. 2010 Apr 01;6(4):e1000888 - PubMed
  47. Nat Genet. 2006 Feb;38(2):203-8 - PubMed
  48. J Lipid Res. 2004 Nov;45(11):1967-74 - PubMed
  49. Arterioscler Thromb Vasc Biol. 2003 Feb 1;23(2):160-7 - PubMed
  50. PLoS One. 2019 Aug 8;14(8):e0220827 - PubMed

Publication Types

Grant support