Display options
Share it on

Sci Data. 2020 Jun 26;7(1):205. doi: 10.1038/s41597-020-0543-2.

Building a PubMed knowledge graph.

Scientific data

Jian Xu, Sunkyu Kim, Min Song, Minbyul Jeong, Donghyeon Kim, Jaewoo Kang, Justin F Rousseau, Xin Li, Weijia Xu, Vetle I Torvik, Yi Bu, Chongyan Chen, Islam Akef Ebeid, Daifeng Li, Ying Ding

Affiliations

  1. School of Information Management, Sun Yat-sen University, Guangzhou, China.
  2. Department of Computer Science and Engineering, Korea University, Seoul, South Korea.
  3. Department of Library and Information Science, Yonsei University, Seoul, South Korea.
  4. Dell Medical School, University of Texas at Austin, Austin, TX, USA.
  5. School of Information, University of Texas at Austin, Austin, TX, USA.
  6. Texas Advanced Computing Center, Austin, TX, USA.
  7. School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, IL, USA.
  8. Department of Information Management, Peking University, Beijing, China.
  9. School of Information Management, Sun Yat-sen University, Guangzhou, China. [email protected].
  10. Dell Medical School, University of Texas at Austin, Austin, TX, USA. [email protected].
  11. School of Information, University of Texas at Austin, Austin, TX, USA. [email protected].

PMID: 32591513 PMCID: PMC7320186 DOI: 10.1038/s41597-020-0543-2

Abstract

PubMed

References

  1. Hakala, K., Kaewphan, S., Salakoski, T. & Ginter, F. Syntactic analyses and named entity recognition for PubMed and PubMed Central—up-to-the-minute. In Proceedings of the 15th Workshop on Biomedical Natural Language Processing 102–107, https://doi.org/10.18653/v1/W16-2913 (2016). - PubMed
  2. Bell, L., Chowdhary, R., Liu, J. S., Niu, X. & Zhang, J. Integrated bio-entity network: a system for biological knowledge discovery. PLoS One 6, e21474 (2011). - PubMed
  3. Torvik, V. I. MapAffil: a bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. Dlib Mag. 21, 11–12, https://doi.org/10.1045/november2015-torvik (2015). - PubMed
  4. Achakulvisut T. Affiliation parser. GitHub, https://github.com/titipata/affiliation_parser/wiki (2017). - PubMed
  5. Torvik, V. I. & Smalheiser, N. R. Author name disambiguation in MEDLINE. ACM Trans. Knowl. Discov. Data 3, 11, https://doi.org/10.1145/1552303.1552304 (2009). - PubMed
  6. Blackburn, R. et al. ORCID Public Data File 2018. figshare https://doi.org/10.23640/07243.7234028.v1 (2018). - PubMed
  7. Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240, https://doi.org/10.1093/bioinformatics/btz682 (2019). - PubMed
  8. Kim, D. et al. A neural named entity recognition and multi-type normalization tool for biomedical text mining. IEEE Access 7, 73729–73740 (2019). - PubMed
  9. Ammar, W. et al. Construction of the literature graph in semantic scholar. In Proceedings of the 2018 Conference of the NAACH-HLT 3, 84–91, https://doi.org/10.18653/v1/N18-3011 (2018). - PubMed
  10. NIH. NIH ExPORTER dataset 2018, http://exporter.nih.gov (2018). - PubMed
  11. Torvik, V. I. MapAffil 2016 dataset–PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign, https://doi.org/10.13012/B2IDB-4354331_V1 (2018). - PubMed
  12. Habibi, M. et al. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33, i37–i48 (2017). - PubMed
  13. Yoon, W., So, C. H., Lee, J. & Kang, J. CollaboNet: collaboration of deep neural networks for biomedical named entity recognition. BMC Bioinformatics 20, 249 (2019). - PubMed
  14. Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the NAACH-HLT 1, 4171–4186, https://doi.org/10.18653/v1/N19-1423 (2019). - PubMed
  15. Wu, Y. et al. Google’s neural machine translation system: bridging the gap between human and machine translation. Preprint at, https://arxiv.org/abs/1609.08144 (2016). - PubMed
  16. Sang, E. F. & Veenstra, J. Representing text chunks. In Proceedings of the Ninth Conference on EACL 173–179, https://doi.org/10.3115/977035.977059 (1999). - PubMed
  17. Buchholz, S. & Marsi, E. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on CoNLL. ACL 149–164, https://doi.org/10.5555/1596276.1596305 (2006). - PubMed
  18. Law, V. et al. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 42, D1091–D1097 (2013). - PubMed
  19. Li, J. C., Yin, Y., Fortunato, S. & Wang, D. S. A dataset of publication records for Nobel laureates. Scientific Data 6, 33 (2019). - PubMed
  20. Laudel, G. Studying the brain drain: can bibliometric methods help? Scientometrics 57, 215–237 (2003). - PubMed
  21. Liu, W. et al. Author name disambiguation for PubMed. J. Assoc. Inf. Sci. Tech. 65, 765–781 (2014). - PubMed
  22. Wu, J. & Ding, X. H. Author name disambiguation in scientific collaboration and mobility cases. Scientometrics 96, 683–697 (2013). - PubMed
  23. Kang, I. S. et al. On co-authorship for author disambiguation. Inf. Process. Manage. 45, 84–97 (2009). - PubMed
  24. Levin, M., Krawczyk, S., Bethard, S. & Jurafsky, D. Citation‐based bootstrapping for large‐scale author disambiguation. J. Am. Soc. Inf. Sci. Technol. 63, 1030–1047 (2012). - PubMed
  25. Wu, H., Li, B., Pei, Y. J. & He, J. Unsupervised author disambiguation using Dempster–Shafer theory. Scientometrics 101, 1955–1972 (2014). - PubMed
  26. Shin, D., Kim, T., Choi, J. & Kim, J. Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics 100, 15–50 (2014). - PubMed
  27. ORCID. About ORCID, https://orcid.org/about (2019). - PubMed
  28. NLM. MEDLINE PubMed XML element descriptions and their attributes, https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html#meshheadinglist (2019). - PubMed
  29. Xu, J. et al. Building a PubMed knowledge graph. figshare https://doi.org/10.6084/m9.figshare.c.4773944 (2020). - PubMed
  30. NLM. Download MEDLINE/PubMed Data, https://www.nlm.nih.gov/databases/download/pubmed_medline.html (2019). - PubMed
  31. Sachan, D. S., Xie, P. T., Sachan, M. & Xing, E. P. Effective use of bidirectional language modeling for transfer learning in biomedical named entity recognition. In Machine Learning for Healthcare Conference 85, 1–19, http://proceedings.mlr.press/v85/sachan18a/sachan18a.pdf (2018). - PubMed
  32. Zhu, H., Paschalidis, I. C. & Tahmasebi, A. Clinical concept extraction with contextual word embedding. In NIPS Machine Learning for Health Workshop 1–6, https://arxiv.org/abs/1810.10566 (2018). - PubMed
  33. Wang, X. et al. Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics 35, 1745–1752 (2019). - PubMed
  34. Giorgi, J. M. & Bader, G. D. Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics 34, 4087–4094 (2018). - PubMed
  35. Lerchenmueller, M. J. & Sorenson, O. Author disambiguation in PubMed: evidence on the precision and recall of author-ity among NIH-funded scientists. PLoS One 11, e0158731 (2016). - PubMed
  36. Kawashima, H. & Tomizawa, H. Accuracy evaluation of Scopus Author ID based on the largest funding database in Japan. Scientometrics 103, 1061–1071 (2015). - PubMed
  37. Warner, E. T., Carapinha, R., Weber, G. M., Hill, E. V. & Reede, J. Y. Faculty promotion and attrition: the importance of coauthor network reach at an academic medical center. J. Gen. Intern. Med. 31, 60–67 (2016). - PubMed
  38. Griffin, M. Professional networking and expertise mining for research collaboration. Profiles research networking software, http://profiles.catalyst.harvard.edu/?pg=home (2019). - PubMed
  39. ELSEVIER. Elsevier fingerprint engine, https://www.elsevier.com/solutions/elsevier-fingerprint-engine (2019). - PubMed
  40. CUSP. CUSP scientific profiles, https://cusp.irvinginstitute.columbia.edu/cusp/cgi-bin/ww2ui.cgi/splash (2019). - PubMed
  41. UCI. Discover UCI faculty, https://www.faculty.uci.edu/ (2019). - PubMed
  42. Yue, W., Yang, C. S., DiPaola, R. S. & Tan, X. L. Repurposing of metformin and aspirin by targeting AMPK-mTOR and inflammation for pancreatic cancer prevention and treatment. Cancer Prev. Res. 7, 388–397 (2014). - PubMed
  43. Bertolini, F., Sukhatme, V. P. & Bouche, G. Drug repurposing in oncology—patient and health systems opportunities. Nat. Rev. Clin. Oncol. 12, 732–742 (2015). - PubMed
  44. Durham, P. L. Calcitonin gene‐related peptide (CGRP) and migraine. Headache 46, S3–S8 (2006). - PubMed
  45. Durham, P. L. CGRP-receptor antagonists—a fresh approach to migraine therapy? N. Engl. J. Med. 350, 1073–1075 (2004). - PubMed
  46. Maglott, D., Ostell, J., Pruitt, K. D. & Tatusova, T. Entrez gene: gene-centered information at NCBI. Nucleic Acids Res. 39, D52–D57, https://doi.org/10.1093/nar/gkq1237 (2010). - PubMed
  47. D’Souza, J. & Ng, V. Sieve-based entity linking for the biomedical domain. In Proceedings of AACL-IJCNLP 2015 2, 297–302, https://doi.org/10.3115/v1/P15-2049 (2015). - PubMed
  48. Lipscomb, C. E. Medical subject headings (MeSH). Bull. Med. Libr. Assoc. 88, 265–266 (2000). - PubMed
  49. Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. & McKusick, V. A. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517 (2005). - PubMed
  50. Donnelly, K. SNOMED-CT: the advanced terminology and coding system for eHealth. Stud. Health Tech. Informat. 121, 279 (2006). - PubMed
  51. Liu, Y. F., Liang, Y. J. & Wishart, D. PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more. Nucleic Acids Res. 43, W535–W542 (2015). - PubMed
  52. Degtyarenko, K. et al. ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 36, D344–D350 (2007). - PubMed
  53. Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001). - PubMed
  54. Landrum, M. J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2016). - PubMed
  55. Doğan, R. I., Leaman, R. & Lu, Z. Y. NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inform. 47, 1–10 (2014). - PubMed
  56. Uzuner, Ö., South, B. R., Shen, S. Y. & DuVall, S. L. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Inf. Assoc. 18, 552–556 (2011). - PubMed
  57. Li, J. et al. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database(Oxford) 2016, baw068, https://doi.org/10.1093/database/baw068 (2016). - PubMed
  58. Krallinger, M. et al. The CHEMDNER corpus of chemicals and drugs and its annotation principles. J. Cheminformatics 7, S2 (2015). - PubMed
  59. Smith, L. et al. Overview of BioCreative II gene mention recognition. Genome Biol. 9, S2 (2008). - PubMed
  60. Kim, J. D., Ohta, T., Tsuruoka, Y., Tateisi, Y. & Collier, N. Introduction to the bio-entity recognition task at JNLPBA. In Proceedings of the NLPBA/BioNLP. ACL 70–75, https://doi.org/10.3115/1567594.1567610 (2004). - PubMed
  61. Gerner, M., Nenadic, G. & Bergman, C. M. LINNAEUS: a species name identification system for biomedical literature. BMC Bioinformatics 11, 85 (2010). - PubMed
  62. Pafilis, E. et al. The SPECIES and ORGANISMS resources for fast and accurate identification of taxonomic names in text. PLoS One 8, e65390 (2013). - PubMed
  63. Morgan, A. A. et al. Overview of BioCreative II gene normalization. Genome Biol. 9, S3 (2008). - PubMed
  64. Lu, Z. et al. The gene normalization task in BioCreative III. BMC Bioinformatics 12, S2 (2011). - PubMed
  65. Pradhan, S. et al. Task 1: ShARe/CLEF eHealth Evaluation Lab. CLEF 1–6, https://pdfs.semanticscholar.org/7dfb/97a2b878673e67062eeab0ba1871eae9a893.pdf (2013). - PubMed
  66. Furlong, L. I., Dach, H., Hofmann-Apitius, M. & Sanz, F. OSIRISv1. 2: a named entity recognition system for sequence variants of genes in biomedical literature. BMC Bioinformatics 9, 84 (2008). - PubMed
  67. Thomas, P. E., Klinger, R., Furlong, L. I., Hofmann-Apitius, M. & Friedrich, C. M. Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers. BMC Bioinformatics 12, S4 (2011). - PubMed
  68. Wei, C. H., Kao, H. Y. & Lu, Z. SR4GN: a species recognition software tool for gene normalization. PLoS One 7, e38460 (2012). - PubMed
  69. Carroll, H. D. et al. Threshold Average Precision (TAP-k): a measure of retrieval designed for bioinformatics. Bioinformatics 26, 1708–1713 (2010). - PubMed

MeSH terms

Publication Types

Grant support