Display options
Share it on

Biology (Basel). 2013 Nov 08;2(4):1282-95. doi: 10.3390/biology2041282.

Algorithms for hidden markov models restricted to occurrences of regular expressions.

Biology

Paula Tataru, Andreas Sand, Asger Hobolth, Thomas Mailund, Christian N S Pedersen

Affiliations

  1. Bioinformatics Research Centre, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus C, Denmark. [email protected].
  2. Bioinformatics Research Centre, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus C, Denmark. [email protected].
  3. Bioinformatics Research Centre, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus C, Denmark. [email protected].
  4. Bioinformatics Research Centre, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus C, Denmark. [email protected].
  5. Bioinformatics Research Centre, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus C, Denmark. [email protected].

PMID: 24833225 PMCID: PMC4009796 DOI: 10.3390/biology2041282

Abstract

Hidden Markov Models (HMMs) are widely used probabilistic models, particularly for annotating sequential data with an underlying hidden structure. Patterns in the annotation are often more relevant to study than the hidden structure itself. A typical HMM analysis consists of annotating the observed data using a decoding algorithm and analyzing the annotation to study patterns of interest. For example, given an HMM modeling genes in DNA sequences, the focus is on occurrences of genes in the annotation. In this paper, we define a pattern through a regular expression and present a restriction of three classical algorithms to take the number of occurrences of the pattern in the hidden sequence into account. We present a new algorithm to compute the distribution of the number of pattern occurrences, and we extend the two most widely used existing decoding algorithms to employ information from this distribution. We show experimentally that the expectation of the distribution of the number of pattern occurrences gives a highly accurate estimate, while the typical procedure can be biased in the sense that the identified number of pattern occurrences does not correspond to the true number. We furthermore show that using this distribution in the decoding algorithms improves the predictive power of the model.

References

  1. Bioinformatics. 2007 Jul 1;23(13):i289-96 - PubMed
  2. BMC Bioinformatics. 2005 Dec 01;6 Suppl 4:S12 - PubMed
  3. J Math Biol. 2008 Jan;56(1-2):51-92 - PubMed
  4. Nucleic Acids Res. 1994 Nov 11;22(22):4768-78 - PubMed
  5. Genomics. 1996 Jun 15;34(3):353-67 - PubMed
  6. J Mol Biol. 1994 Feb 4;235(5):1501-31 - PubMed
  7. Nucleic Acids Res. 1998 Feb 15;26(4):1107-15 - PubMed
  8. Bioinformatics. 1998;14(9):755-63 - PubMed
  9. J Mol Biol. 2001 Jan 19;305(3):567-80 - PubMed
  10. Proteins. 1999;Suppl 3:121-5 - PubMed
  11. PLoS Genet. 2011 Mar;7(3):e1001319 - PubMed
  12. J Bioinform Comput Biol. 2010 Jun;8(3):535-51 - PubMed

Publication Types