Display options
Share it on

Methods Mol Biol. 2018;1825:369-410. doi: 10.1007/978-1-4939-8639-2_13.

Selection of Informative Examples in Chemogenomic Datasets.

Methods in molecular biology (Clifton, N.J.)

Daniel Reker, J B Brown

Affiliations

  1. Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA. [email protected].
  2. Life Science Informatics Research Unit, Laboratory of Molecular Biosciences, Kyoto University Graduate School of Medicine, Kyoto, Japan.

PMID: 30334214 DOI: 10.1007/978-1-4939-8639-2_13

Abstract

High-throughput and high-content screening campaigns have resulted in the creation of large chemogenomic matrices. These matrices form the training data which is used to build ligand-target interaction models for pharmacological and chemical biology research. While academic, government, and industrial efforts continuously add to the ligand-target data pairs available for modeling, major research efforts are devoted to improving machine learning techniques to cope with the sparseness, heterogeneity, and size of available datasets as well as inherent noise and bias. This "race of arms" has led to the creation of algorithms to generate highly complex models with high prediction performance at the cost of training efficiency as well as interpretability.In contrast, recent studies have challenged the necessity for "big data" in chemogenomic modeling and found that models built on larger numbers of examples do not necessarily result in better predictive abilities. Automated adaptive selection of the training data (ligand-target instances) used for model creation can result in considerably smaller training sets that retain prediction performance on par with training using hundreds of thousands of data points. In this chapter, we describe the protocols used for one such iterative chemogenomic selection technique, including model construction and update as well as possible techniques for evaluations of constructed models and analysis of the iterative model construction.

Keywords: Active learning; Data mining; Model complexity; Sampling; Subset selection

Substances

MeSH terms

Publication Types