Display options
Share it on

Front Genet. 2021 Aug 05;12:659650. doi: 10.3389/fgene.2021.659650. eCollection 2021.

Unraveling City-Specific Microbial Signatures and Identifying Sample Origins for the Data From CAMDA 2020 Metagenomic Geolocation Challenge.

Frontiers in genetics

Runzhi Zhang, Dorothy Ellis, Alejandro R Walker, Susmita Datta

Affiliations

  1. Department of Biostatistics, University of Florida, Gainesville, FL, United States.
  2. Department of Oral Biology, University of Florida, Gainesville, FL, United States.

PMID: 34421984 PMCID: PMC8375386 DOI: 10.3389/fgene.2021.659650

Abstract

The composition of microbial communities has been known to be location-specific. Investigating the microbial composition across different cities enables us to unravel city-specific microbial signatures and further predict the origin of unknown samples. As part of the CAMDA 2020 Metagenomic Geolocation Challenge, MetaSUB provided the whole genome shotgun (WGS) metagenomics data from samples across 28 cities along with non-microbial city data for 23 of these cities. In our solution to this challenge, we implemented feature selection, normalization, clustering and three methods of machine learning to classify the cities based on their microbial compositions. Of the three methods, multilayer perceptron obtained the best performance with an error rate of 19.60% based on whether the correct city received the highest or second highest number of votes for the test data contained in the main dataset. We then trained the model to predict the origins of samples from the mystery dataset by including these samples with the additional group label of "mystery." The mystery dataset compromised of samples collected from a subset of the cities in the main dataset as well as samples collected from new cities. For samples from cities that belonged to the main dataset, error rates ranged from 18.18 to 72.7%. For samples from new cities that did not belong to the main dataset, 57.7% of the test samples could be correctly labeled as "mystery" samples. Furthermore, we also predicted some of the non-microbial features for the mystery samples from the cities that did not belong to main dataset to draw inferences and narrow the range of the possible sample origins using a multi-output multilayer perceptron algorithm.

Copyright © 2021 Zhang, Ellis, Walker and Datta.

Keywords: OTU; feature selection; microbiome; multilayer perceptron; random forest; support vector machine

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Biol Direct. 2018 May 22;13(1):10 - PubMed
  2. Diabetes Care. 2015 Jan;38(1):159-65 - PubMed
  3. Appl Environ Microbiol. 2007 Aug;73(16):5261-7 - PubMed
  4. Appl Environ Microbiol. 2011 Feb;77(4):1153-61 - PubMed
  5. Sci Rep. 2017 Mar 10;7:44049 - PubMed
  6. Biol Direct. 2019 Jul 24;14(1):11 - PubMed
  7. Genome Biol. 2014 Feb 03;15(2):R29 - PubMed
  8. Science. 2018 Jan 19;359(6373):320-325 - PubMed
  9. N Engl J Med. 2016 Dec 15;375(24):2369-2379 - PubMed
  10. Proc Natl Acad Sci U S A. 2018 May 1;115(18):E4255-E4263 - PubMed
  11. PLoS One. 2012;7(2):e30619 - PubMed
  12. Biol Direct. 2021 Jan 4;16(1):1 - PubMed
  13. Nucleic Acids Res. 2015 Apr 20;43(7):e47 - PubMed
  14. FEMS Microbiol Ecol. 2007 Feb;59(2):524-34 - PubMed
  15. Cell. 2006 Feb 24;124(4):837-48 - PubMed
  16. J Microbiol Biotechnol. 2009 Feb;19(2):187-93 - PubMed
  17. Front Microbiol. 2015 Feb 10;6:53 - PubMed
  18. Nat Methods. 2010 May;7(5):335-6 - PubMed
  19. PLoS One. 2014 Aug 11;9(8):e104996 - PubMed

Publication Types