Display options
Share it on

Biotechnol Biotechnol Equip. 2014 Sep 03;28(5):958-967. doi: 10.1080/13102818.2014.959711. Epub 2014 Oct 31.

Towards computational improvement of DNA database indexing and short DNA query searching.

Biotechnology, biotechnological equipment

Done Stojanov, Sašo Koceski, Aleksandra Mileva, Nataša Koceska, Cveta Martinovska Bande

Affiliations

  1. Faculty of Computer Science, Department of Computer Technologies and Intelligent Systems, University "Goce Del?ev" , Štip , Republic of Macedonia.

PMID: 26019584 PMCID: PMC4434100 DOI: 10.1080/13102818.2014.959711

Abstract

In order to facilitate and speed up the search of massive DNA databases, the database is indexed at the beginning, employing a mapping function. By searching through the indexed data structure, exact query hits can be identified. If the database is searched against an annotated DNA query, such as a known promoter consensus sequence, then the starting locations and the number of potential genes can be determined. This is particularly relevant if unannotated DNA sequences have to be functionally annotated. However, indexing a massive DNA database and searching an indexed data structure with millions of entries is a time-demanding process. In this paper, we propose a fast DNA database indexing and searching approach, identifying all query hits in the database, without having to examine all entries in the indexed data structure, limiting the maximum length of a query that can be searched against the database. By applying the proposed indexing equation, the whole human genome could be indexed in 10 hours on a personal computer, under the assumption that there is enough RAM to store the indexed data structure. Analysing the methodology proposed by Reneker, we observed that hits at starting positions [Formula: see text] are not reported, if the database is searched against a query shorter than [Formula: see text] nucleotides, such that [Formula: see text] is the length of the DNA database words being mapped and [Formula: see text] is the length of the query. A solution of this drawback is also presented.

Keywords: DNA database; E. coli; all hits; fast indexing and search

References

  1. Bioinformatics. 1999 Feb;15(2):111-21 - PubMed
  2. Genome Res. 2002 Apr;12(4):656-64 - PubMed
  3. Bioinformatics. 2010 Jun 15;26(12):i367-73 - PubMed
  4. Nucleic Acids Res. 2002 Jun 1;30(11):2478-83 - PubMed
  5. Nucleic Acids Res. 1999 Jun 1;27(11):2369-76 - PubMed
  6. Bioinformatics. 2002 Jun;18(6):873-7 - PubMed
  7. Bioinformatics. 2009 Jul 1;25(13):1609-16 - PubMed
  8. J Comput Biol. 2005 Sep;12(7):943-51 - PubMed
  9. Nucleic Acids Res. 1999 Jan 15;27(2):573-80 - PubMed
  10. BMC Bioinformatics. 2005 May 03;6:111 - PubMed
  11. Genome Res. 2001 Oct;11(10):1725-9 - PubMed
  12. Bioinformatics. 1998;14(1):55-67 - PubMed
  13. Science. 1985 Mar 22;227(4693):1435-41 - PubMed
  14. J Mol Biol. 1990 Oct 5;215(3):403-10 - PubMed
  15. Genome Biol. 2004;5(2):R12 - PubMed
  16. Bioinformatics. 2013 Mar 15;29(6):802-4 - PubMed
  17. J Comput Biol. 2005 May;12(4):407-15 - PubMed
  18. J Mol Biol. 1970 Mar;48(3):443-53 - PubMed
  19. Bioinformatics. 2008 Mar 15;24(6):791-7 - PubMed
  20. BMC Bioinformatics. 2006 Oct 03;7:427 - PubMed
  21. J Mol Biol. 1981 Mar 25;147(1):195-7 - PubMed

Publication Types