Display options
Share it on

Online J Bioinform. 2007 Jan 01;8(1):30-40.

PIDA:A new algorithm for pattern identification.

Online journal of bioinformatics : OJB

C Putonti, Bm Pettitt, Jg Reid, Y Fofanov

Affiliations

  1. Department of Computer Science, University of Houston, Houston, Texas, USA.

PMID: 19834570 PMCID: PMC2761635

Abstract

Algorithms for motif identification in sequence space have predominately been focused on recognizing patterns of a fixed length containing regions of perfect conservation with possible regions of unconstrained sequence. Such motifs can be found in everything from proteins with distinct active sites to non-coding RNAs with specific structural elements that are necessary to maintain functionality. In the event that an insertion/deletion has occurred within an unconstrained portion of the pattern, it is possible that the pattern retains its functionality. In such a case the length of the pattern is now variable and may be overlooked when utilizing existing motif detection methods. The Pattern Island Detection Algorithm (PIDA) presented here has been developed to recognize patterns that have occurrences of varying length within sequences of any size alphabet. PIDA works by identifying all regions of perfect conservation (for lengths longer than a user-specified threshold), and then builds those conservation "islands" into fixed-length patterns. Next the algorithm modifies these fixed-length patterns by identifying additional (and different) islands that can be incorporated into each pattern through insertions/deletions within the "water" separating the islands. To provide some benchmarks for this analysis, PIDA was used to search for patterns within randomly generated sequences as well as sequences known to contain conserved patterns. For each of the patterns found, the statistical significance is calculated based upon the pattern's likelihood to appear by chance, thus providing a means to determine those patterns which are likely to have a functional role. The PIDA approach to motif finding is designed to perform best when searching for patterns of variable length although it is also able to identify patterns of a fixed length. PIDA has been created to be as generally applicable as possible since there are a variety of sequence problems of this type. The algorithm was implemented in C++ and is freely available upon request from the authors.

References

  1. J Mol Biol. 1998 Nov 27;284(2):241-54 - PubMed
  2. Bioinformatics. 2004 Jul 10;20(10):1591-602 - PubMed
  3. Proc Int Conf Intell Syst Mol Biol. 1994;2:28-36 - PubMed
  4. J Comput Biol. 2000;7(3-4):345-62 - PubMed
  5. J Bioinform Comput Biol. 2004 Mar;2(1):127-54 - PubMed
  6. BMC Bioinformatics. 2003 May 16;4:18 - PubMed
  7. Gene. 1995 Oct 3;163(2):GC17-26 - PubMed
  8. Nucleic Acids Res. 2004 Jun 15;32(10):3258-69 - PubMed
  9. Proc Natl Acad Sci U S A. 1992 May 15;89(10):4500-4 - PubMed
  10. Proc Int Conf Intell Syst Mol Biol. 2000;8:344-54 - PubMed
  11. Bioinformatics. 1998;14(1):55-67 - PubMed
  12. Nucleic Acids Res. 2001 Feb 1;29(3):774-82 - PubMed
  13. Nat Biotechnol. 1998 Oct;16(10):939-45 - PubMed
  14. J Mol Biol. 2000 Mar 24;297(2):335-53 - PubMed
  15. Bioinformatics. 1999 Jul-Aug;15(7-8):563-77 - PubMed
  16. Protein Sci. 1995 Aug;4(8):1618-32 - PubMed
  17. Nucleic Acids Res. 1995 Dec 11;23(23):4878-84 - PubMed
  18. Proc Natl Acad Sci U S A. 2002 Sep 3;99(18):11772-7 - PubMed
  19. Bioinformatics. 2002 Aug;18(8):1135-6 - PubMed
  20. Nat Biotechnol. 2005 Jan;23(1):137-44 - PubMed
  21. Bioinformatics. 2006 Feb 15;22(4):445-52 - PubMed
  22. BMC Bioinformatics. 2006 May 05;7:244 - PubMed
  23. Genome Res. 2000 Jun;10(6):744-57 - PubMed
  24. Bioinformatics. 1998;14(4):317-25 - PubMed
  25. BMC Bioinformatics. 2005 Apr 07;6:89 - PubMed

Publication Types

Grant support