Display options
Share it on

Database (Oxford). 2016 May 09;2016. doi: 10.1093/database/baw068. Print 2016.

BioCreative V CDR task corpus: a resource for chemical disease relation extraction.

Database : the journal of biological databases and curation

Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Thomas C Wiegers, Zhiyong Lu

Affiliations

  1. 1Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China.
  2. 2Department of Biological Sciences and the Center for Human Health and the Environment, North Carolina State University, Raleigh, NC 27695, USA.
  3. 3National Center for Biotechnology Information, Bethesda, MD 20894, USA.
  4. 3National Center for Biotechnology Information, Bethesda, MD 20894, USA [email protected].

PMID: 27161011 PMCID: PMC4860626 DOI: 10.1093/database/baw068

Abstract

Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community.Database URL: http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/.

Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the United States.

References

  1. Nucleic Acids Res. 2015 Jan;43(Database issue):D914-20 - PubMed
  2. Database (Oxford). 2011;2011:bar034 - PubMed
  3. Database (Oxford). 2013;2013:bat080 - PubMed
  4. J Biomed Inform. 2011 Apr;44(2):310-8 - PubMed
  5. BMC Bioinformatics. 2007;8:423 - PubMed
  6. Database (Oxford). 2013;2013:bat064 - PubMed
  7. Database (Oxford). 2009;2009:bap018 - PubMed
  8. Bioinformatics. 2013 Nov 15;29(22):2909-17 - PubMed
  9. J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S2 - PubMed
  10. J Biomed Inform. 2014 Feb;47:1-10 - PubMed
  11. Nucleic Acids Res. 2009 Jan;37(Database issue):D786-92 - PubMed
  12. BMC Bioinformatics. 2009;10:326 - PubMed
  13. F1000Res. 2014 Apr 25;3:96 - PubMed
  14. J Am Med Inform Assoc. 2011 Sep-Oct;18(5):660-7 - PubMed
  15. BMC Bioinformatics. 2011;12 Suppl 8:S3 - PubMed
  16. Bull Med Libr Assoc. 2000 Jul;88(3):265-6 - PubMed
  17. Database (Oxford). 2014;2014. pii: bau050. doi: 10.1093/database/bau050 - PubMed
  18. J Biomed Inform. 2012 Oct;45(5):885-92 - PubMed
  19. J Biomed Inform. 2013 Oct;46(5):914-20 - PubMed
  20. Nucleic Acids Res. 2013 Jul;41(Web Server issue):W518-22 - PubMed
  21. Brief Bioinform. 2016 Jan;17(1):132-44 - PubMed
  22. J Biomed Inform. 2012 Oct;45(5):879-84 - PubMed
  23. Database (Oxford). 2012;2012:bas041 - PubMed
  24. J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S3 - PubMed

MeSH terms

Publication Types

Grant support