Display options
Share it on

PeerJ. 2017 Mar 16;5:e3091. doi: 10.7717/peerj.3091. eCollection 2017.

Effect of method of deduplication on estimation of differential gene expression using RNA-seq.

PeerJ

Anna V Klepikova, Artem S Kasianov, Mikhail S Chesnokov, Natalia L Lazarevich, Aleksey A Penin, Maria Logacheva

Affiliations

  1. Institute for Information Transmission Problems of the Russian Academy of Sciences, Moscow, Russia.
  2. A. N. Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, Russia.
  3. N. I. Vavilov Institute for General Genetics, Moscow, Russia.
  4. N.N. Blokhin Russian Cancer Research Center of the Ministry of Health of the Russian Federation, Moscow, Russia.
  5. Department of Biology, Lomonosov Moscow State University, Moscow, Russia.
  6. Extreme Biology Laboratory, Institute of Fundamental Medicine and Biology, Kazan Federal University, Kazan.

PMID: 28321364 PMCID: PMC5357343 DOI: 10.7717/peerj.3091

Abstract

BACKGROUND: RNA-seq is a useful tool for analysis of gene expression. However, its robustness is greatly affected by a number of artifacts. One of them is the presence of duplicated reads.

RESULTS: To infer the influence of different methods of removal of duplicated reads on estimation of gene expression in cancer genomics, we analyzed paired samples of hepatocellular carcinoma (HCC) and non-tumor liver tissue. Four protocols of data analysis were applied to each sample: processing without deduplication, deduplication using a method implemented in SAMtools, and deduplication based on one or two molecular indices (MI). We also analyzed the influence of sequencing layout (single read or paired end) and read length. We found that deduplication without MI greatly affects estimated expression values; this effect is the most pronounced for highly expressed genes.

CONCLUSION: The use of unique molecular identifiers greatly improves accuracy of RNA-seq analysis, especially for highly expressed genes. We developed a set of scripts that enable handling of MI and their incorporation into RNA-seq analysis pipelines. Deduplication without MI affects results of differential gene expression analysis, producing a high proportion of false negative results. The absence of duplicate read removal is biased towards false positives. In those cases where using MI is not possible, we recommend using paired-end sequencing layout.

Keywords: Cancer genomics; Deduplication; Differential expression; Hepatocarcinoma; RNA-seq

Conflict of interest statement

The authors declare there are no competing interests.

References

  1. PLoS One. 2012;7(12):e52249 - PubMed
  2. Sci Rep. 2015 Jul 16;5:12136 - PubMed
  3. Nat Methods. 2011 Nov 20;9(1):72-4 - PubMed
  4. Proc Natl Acad Sci U S A. 2012 Jan 24;109(4):1347-52 - PubMed
  5. Nat Rev Genet. 2016 May;17(5):257-71 - PubMed
  6. Genome Biol. 2013 Apr 25;14(4):R36 - PubMed
  7. Genome Biol. 2010;11(10):R106 - PubMed
  8. Bioinformatics. 2014 Sep 1;30(17):2503-5 - PubMed
  9. Genomics Proteomics Bioinformatics. 2011 Dec;9(6):238-44 - PubMed
  10. Bioinformatics. 2012 May 15;28(10):1324-7 - PubMed
  11. BMC Bioinformatics. 2015;16 Suppl 13:S10 - PubMed
  12. Bioinformatics. 2013 Apr 1;29(7):830-6 - PubMed
  13. Nat Commun. 2016 Jun 27;7:12030 - PubMed
  14. Proc Natl Acad Sci U S A. 2014 Feb 4;111(5):1891-6 - PubMed
  15. Bioinformatics. 2009 Aug 15;25(16):2078-9 - PubMed
  16. Clin Biochem. 2016 Jul;49(10-11):787-91 - PubMed
  17. Sci Signal. 2013 Apr 02;6(269):pl1 - PubMed
  18. Int J Clin Exp Pathol. 2014 Apr 15;7(5):2079-90 - PubMed
  19. FEBS Lett. 2016 Aug;590(15):2375-97 - PubMed
  20. Oncogene. 2007 Feb 1;26(5):774-80 - PubMed
  21. BMC Bioinformatics. 2016 Oct 8;17 (1):419 - PubMed
  22. BMC Bioinformatics. 2010 Apr 13;11:187 - PubMed
  23. Biotechniques. 2012 Feb;52(2):87-94 - PubMed
  24. CA Cancer J Clin. 2016 Jan-Feb;66(1):75-88 - PubMed
  25. Genome Biol. 2011;12(2):R18 - PubMed
  26. Dev Biol. 2012 Aug 15;368(2):382-92 - PubMed
  27. Proc Natl Acad Sci U S A. 2011 May 31;108(22):9026-31 - PubMed
  28. Clin Sci (Lond). 2016 May 1;130(10):785-99 - PubMed
  29. J Cancer. 2014 Jul 21;5(7):598-608 - PubMed
  30. Evid Based Complement Alternat Med. 2013;2013:502568 - PubMed
  31. Hepatol Res. 2014 Dec;44(13):1357-66 - PubMed
  32. Cell. 1991 Sep 6;66(5):849-59 - PubMed
  33. Genome Biol. 2014;15(12):550 - PubMed

Publication Types