Display options
Share it on

PLoS One. 2013 Dec 02;8(12):e80503. doi: 10.1371/journal.pone.0080503. eCollection 2013.

Knowledge and theme discovery across very large biological data sets using distributed queries: a prototype combining unstructured and structured data.

PloS one

Uma S Mudunuri, Mohamad Khouja, Stephen Repetski, Girish Venkataraman, Anney Che, Brian T Luke, F Pascal Girard, Robert M Stephens

Affiliations

  1. Advanced Biomedical Computing Center, Information Systems Program, SAIC-Frederick, Inc., Frederick National Laboratory for Cancer Research, Frederick, Maryland, United States of America.

PMID: 24312478 PMCID: PMC3846626 DOI: 10.1371/journal.pone.0080503

Abstract

As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework.

References

  1. Nucleic Acids Res. 2013 Jan;41(Database issue):D1-7 - PubMed
  2. Bioinformatics. 2009 Jun 1;25(11):1363-9 - PubMed
  3. Nucleic Acids Res. 2008 Jul 1;36(Web Server issue):W406-10 - PubMed
  4. BMC Res Notes. 2011 Jun 06;4:171 - PubMed
  5. Nucleic Acids Res. 2010 Jan;38(Database issue):D5-16 - PubMed
  6. BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S1 - PubMed
  7. J Med Internet Res. 2012 Oct 04;14(5):e125 - PubMed
  8. Bioinformatics. 2009 Feb 15;25(4):555-6 - PubMed
  9. BMC Bioinformatics. 2006 Aug 10;7:373 - PubMed
  10. Neuro Oncol. 2012 Dec;14(12):1432-40 - PubMed
  11. Nat Genet. 2001 May;28(1):21-8 - PubMed
  12. J Biomed Semantics. 2012 Sep 07;3(1):7 - PubMed
  13. Database (Oxford). 2011 Nov 13;2011:bar049 - PubMed
  14. Clin Infect Dis. 2012 Feb 15;54(4):463-9 - PubMed
  15. Genome Biol. 2009;10(11):R134 - PubMed
  16. PLoS Biol. 2004 Nov;2(11):e309 - PubMed
  17. Database (Oxford). 2011 Sep 19;2011:bar038 - PubMed
  18. Bioinformatics. 2012 Dec 1;28(23):3158-60 - PubMed
  19. Inj Prev. 2007 Aug;13(4):232-6 - PubMed
  20. Mol Syst Biol. 2008;4:230 - PubMed
  21. Drug Discov Today. 2009 Feb;14(3-4):147-54 - PubMed
  22. BMC Bioinformatics. 2003 Dec 10;4:61 - PubMed

MeSH terms

Publication Types

Grant support