Display options
Share it on

PeerJ. 2016 Feb 08;4:e1603. doi: 10.7717/peerj.1603. eCollection 2016.

PhyloPythiaS+: a self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes.

PeerJ

Ivan Gregor, Johannes Dröge, Melanie Schirmer, Christopher Quince, Alice C McHardy

Affiliations

  1. Max-Planck Research Group for Computational Genomics and Epidemiology, Max-Planck Institute for Informatics, Saarbrücken, Germany; Department of Algorithmic Bioinformatics, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany; Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany.
  2. The Broad Institute of MIT and Harvard , Cambridge, MA , United States.
  3. School of Engineering, University of Glasgow , Glasgow , United Kingdom.

PMID: 26870609 PMCID: PMC4748697 DOI: 10.7717/peerj.1603

Abstract

Background. Metagenomics is an approach for characterizing environmental microbial communities in situ, it allows their functional and taxonomic characterization and to recover sequences from uncultured taxa. This is often achieved by a combination of sequence assembly and binning, where sequences are grouped into 'bins' representing taxa of the underlying microbial community. Assignment to low-ranking taxonomic bins is an important challenge for binning methods as is scalability to Gb-sized datasets generated with deep sequencing techniques. One of the best available methods for species bins recovery from deep-branching phyla is the expert-trained PhyloPythiaS package, where a human expert decides on the taxa to incorporate in the model and identifies 'training' sequences based on marker genes directly from the sample. Due to the manual effort involved, this approach does not scale to multiple metagenome samples and requires substantial expertise, which researchers who are new to the area do not have. Results. We have developed PhyloPythiaS+, a successor to our PhyloPythia(S) software. The new (+) component performs the work previously done by the human expert. PhyloPythiaS+ also includes a new k-mer counting algorithm, which accelerated the simultaneous counting of 4-6-mers used for taxonomic binning 100-fold and reduced the overall execution time of the software by a factor of three. Our software allows to analyze Gb-sized metagenomes with inexpensive hardware, and to recover species or genera-level bins with low error rates in a fully automated fashion. PhyloPythiaS+ was compared to MEGAN, taxator-tk, Kraken and the generic PhyloPythiaS model. The results showed that PhyloPythiaS+ performs especially well for samples originating from novel environments in comparison to the other methods. Availability. PhyloPythiaS+ in a virtual machine is available for installation under Windows, Unix systems or OS X on: https://github.com/algbioi/ppsp/wiki.

Keywords: Bioinformatics; Machine learning; Metagenomics; Taxonomic classification

References

  1. Nat Methods. 2011 May;8(5):367 - PubMed
  2. Genome Biol. 2002;3(2):REVIEWS0003 - PubMed
  3. Proc Natl Acad Sci U S A. 2010 Apr 20;107(16):7503-8 - PubMed
  4. Nat Methods. 2012 Jun 10;9(8):811-4 - PubMed
  5. Brief Bioinform. 2012 Nov;13(6):646-55 - PubMed
  6. Genome Res. 2011 Sep;21(9):1552-60 - PubMed
  7. Science. 2011 Jul 29;333(6042):646-8 - PubMed
  8. Genome Biol. 2012 Dec 22;13(12):R122 - PubMed
  9. BMC Genomics. 2010 Aug 05;11:461 - PubMed
  10. Nucleic Acids Res. 2011 Aug;39(14):e91 - PubMed
  11. Bioinformatics. 2012 Apr 1;28(7):1033-4 - PubMed
  12. Microbiol Mol Biol Rev. 2008 Dec;72(4):557-78, Table of Contents - PubMed
  13. Bioinformatics. 2014 Jul 15;30(14):1950-7 - PubMed
  14. Nat Methods. 2013 Jun;10(6):563-9 - PubMed
  15. Nat Biotechnol. 2008 Sep;26(9):1029-34 - PubMed
  16. Bioinformatics. 2011 Jan 1;27(1):127-9 - PubMed
  17. Nucleic Acids Res. 2012 Nov 1;40(20):e155 - PubMed
  18. Nat Rev Genet. 2010 Jan;11(1):31-46 - PubMed
  19. Bioinformatics. 2011 Mar 15;27(6):764-70 - PubMed
  20. PeerJ. 2014 Jun 05;2:e425 - PubMed
  21. Nat Methods. 2007 Jan;4(1):63-72 - PubMed
  22. Genome Biol. 2008 Oct 13;9(10):R151 - PubMed
  23. Trends Genet. 1995 Jul;11(7):283-90 - PubMed
  24. PLoS One. 2012;7(6):e38571 - PubMed
  25. Bioinformatics. 2014 Jul 15;30(14):2070-2 - PubMed
  26. BMC Genomics. 2015 Mar 25;16:236 - PubMed
  27. Bioinformatics. 2015 Mar 15;31(6):817-24 - PubMed
  28. Nat Methods. 2011 Mar;8(3):191-2 - PubMed
  29. Bioinformatics. 2015 May 15;31(10):1569-76 - PubMed
  30. Bioinformatics. 2011 Jun 15;27(12):1618-24 - PubMed
  31. Nat Rev Microbiol. 2012 Oct;10(10):674 - PubMed
  32. PLoS One. 2012;7(6):e38581 - PubMed
  33. Appl Environ Microbiol. 2009 Dec;75(23):7537-41 - PubMed
  34. Nat Methods. 2013 Dec;10(12):1196-9 - PubMed
  35. Genome Biol. 2014 Mar 03;15(3):R46 - PubMed
  36. Science. 2011 Jan 28;331(6016):463-7 - PubMed
  37. J Comput Biol. 2011 Mar;18(3):429-43 - PubMed
  38. Annu Rev Genet. 2004;38:525-52 - PubMed
  39. Nature. 2013 Jan 3;493(7430):45-50 - PubMed
  40. Proc Natl Acad Sci U S A. 2012 Aug 14;109(33):13272-7 - PubMed
  41. Bioinformatics. 2011 Jul 1;27(13):i94-101 - PubMed
  42. Nat Rev Microbiol. 2013 Mar;11(3):213-7 - PubMed
  43. BMC Genomics. 2011;12 Suppl 2:S4 - PubMed
  44. Mol Biol Evol. 1999 Oct;16(10):1391-9 - PubMed
  45. Nat Rev Microbiol. 2012 Sep;10(9):599-606 - PubMed

Publication Types

Grant support