Display options
Share it on

Biomolecules. 2018 Mar 14;8(1). doi: 10.3390/biom8010012.

The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction.

Biomolecules

Hongjian Li, Jiangjun Peng, Yee Leung, Kwong-Sak Leung, Man-Hon Wong, Gang Lu, Pedro J Ballester

Affiliations

  1. SDIVF R&D Centre, Hong Kong Science Park, Sha Tin, New Territories, Hong Kong, China. [email protected].
  2. Institute of Future Cities, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China. [email protected].
  3. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China. [email protected].
  4. Institute of Future Cities, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China. [email protected].
  5. School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China. [email protected].
  6. Institute of Future Cities, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China. [email protected].
  7. Institute of Future Cities, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China. [email protected].
  8. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China. [email protected].
  9. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China. [email protected].
  10. School of Biomedical Sciences, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China. [email protected].
  11. Cancer Research Center of Marseille, INSERM U1068, F-13009 Marseille, France. [email protected].
  12. Institut Paoli-Calmettes, F-13009 Marseille, France. [email protected].
  13. Aix-Marseille Université, F-13284 Marseille, France. [email protected].
  14. CNRS UMR7258, F-13009 Marseille, France. [email protected].

PMID: 29538331 PMCID: PMC5871981 DOI: 10.3390/biom8010012

Abstract

It has recently been claimed that the outstanding performance of machine-learning scoring functions (SFs) is exclusively due to the presence of training complexes with highly similar proteins to those in the test set. Here, we revisit this question using 24 similarity-based training sets, a widely used test set, and four SFs. Three of these SFs employ machine learning instead of the classical linear regression approach of the fourth SF (X-Score which has the best test set performance out of 16 classical SFs). We have found that random forest (RF)-based RF-Score-v3 outperforms X-Score even when 68% of the most similar proteins are removed from the training set. In addition, unlike X-Score, RF-Score-v3 is able to keep learning with an increasing training set size, becoming substantially more predictive than X-Score when the full 1105 complexes are used for training. These results show that machine-learning SFs owe a substantial part of their performance to training on complexes with dissimilar proteins to those in the test set, against what has been previously concluded using the same data. Given that a growing amount of structural and interaction data will be available from academic and industrial sources, this performance gap between machine-learning SFs and classical SFs is expected to enlarge in the future.

Keywords: binding affinity prediction; machine learning; molecular docking; scoring function

Conflict of interest statement

The authors declare no conflict of interest.

References

  1. Molecules. 2015 Jun 12;20(6):10947-62 - PubMed
  2. J Chem Inf Model. 2011 Nov 28;51(11):2897-903 - PubMed
  3. J Chem Inf Model. 2011 Aug 22;51(8):1739-41 - PubMed
  4. BMC Bioinformatics. 2016 Sep 22;17 (Suppl 11):308 - PubMed
  5. J Chem Inf Model. 2009 Apr;49(4):1079-93 - PubMed
  6. J Chem Inf Model. 2013 Jan 28;53(1):114-22 - PubMed
  7. Mol Inform. 2015 Feb;34(2-3):115-26 - PubMed
  8. BMC Bioinformatics. 2014 Aug 27;15:291 - PubMed
  9. J Chem Inf Model. 2017 Apr 24;57(4):1007-1012 - PubMed
  10. Bioinformatics. 2010 May 1;26(9):1169-75 - PubMed
  11. J Comput Chem. 2017 Jan 30;38(3):169-177 - PubMed
  12. J Chem Inf Model. 2013 Aug 26;53(8):1923-33 - PubMed
  13. Wiley Interdiscip Rev Comput Mol Sci. 2015 Nov-Dec;5(6):405-424 - PubMed
  14. J Chem Inf Model. 2010 Nov 22;50(11):1961-9 - PubMed
  15. J Chem Inf Model. 2014 Oct 27;54(10):2807-15 - PubMed
  16. Sci Rep. 2017 Apr 25;7:46710 - PubMed
  17. Nucleic Acids Res. 2016 Jul 8;44(W1):W557-61 - PubMed
  18. J Chem Inf Model. 2016 Dec 27;56(12 ):2495-2506 - PubMed
  19. Sci Rep. 2016 Apr 22;6:24817 - PubMed
  20. J Chem Inf Model. 2011 Sep 26;51(9):2132-8 - PubMed
  21. IEEE/ACM Trans Comput Biol Bioinform. 2015 Nov-Dec;12 (6):1464-9 - PubMed
  22. IEEE/ACM Trans Comput Biol Bioinform. 2015 Mar-Apr;12(2):335-47 - PubMed

MeSH terms

Publication Types