TY - JOUR
T1 - Analyzing taxonomic classification using extensible Markov models
AU - Kotamarti, Rao M.
AU - Hahsler, Michael
AU - Raiford, Douglas
AU - McGee, Monnie
AU - Dunham, Margaret H.
AU - Rost, Burkhard
N1 - Publisher Copyright:
© The Author(s) 2010.
PY - 2011
Y1 - 2011
N2 - Motivation: As next generation sequencing is rapidly adding new genomes, their correct placement in the taxonomy needs verification. However, the current methods for confirming classification of a taxon or suggesting revision for a potential misplacement relies on computationally intense multi-sequence alignment followed by an iterative adjustment of the distance matrix. Due to intra-heterogeneity issues with the 16S rRNA marker, no classifier is available for sub-genus level, which could readily suggest a classification for a novel 16S rRNA sequence. Metagenomics further complicates the issue by generating fragmented 16S rRNA sequences. This article proposes a novel alignment-free method for representing the microbial profiles using extensible Markov models (EMMs) with an extended Karlin-Altschul statistical framework similar to the classic alignment paradigm. We propose a log odds (LODs) score classifier based on Gumbel difference distribution that confirms correct classifications with statistical significance qualifications and suggests revisions where necessary. Results: We tested our method by generating a sub-genus level classifier with which we re-evaluated classifications of 676 microbial organisms using the NCBI FTP database for the 16S rRNA. The results confirm current classification for all genera while ascertaining significance at 95%. Furthermore, this novel classifier isolates heterogeneity issues to a mere 12 strains while confirming classifications with significance qualification for the remaining 98%. The models require less memory than that needed by multisequence alignments and have better time complexity than the current methods. The classifier operates at sub-genus level, and thus outperforms the naive Bayes classifier of the RNA Database Project where much of the taxonomic analysis is available online. Finally, using information redundancy in model building, we show that the method applies to metagenomic fragment classification of 19 Escherichia coli strains.
AB - Motivation: As next generation sequencing is rapidly adding new genomes, their correct placement in the taxonomy needs verification. However, the current methods for confirming classification of a taxon or suggesting revision for a potential misplacement relies on computationally intense multi-sequence alignment followed by an iterative adjustment of the distance matrix. Due to intra-heterogeneity issues with the 16S rRNA marker, no classifier is available for sub-genus level, which could readily suggest a classification for a novel 16S rRNA sequence. Metagenomics further complicates the issue by generating fragmented 16S rRNA sequences. This article proposes a novel alignment-free method for representing the microbial profiles using extensible Markov models (EMMs) with an extended Karlin-Altschul statistical framework similar to the classic alignment paradigm. We propose a log odds (LODs) score classifier based on Gumbel difference distribution that confirms correct classifications with statistical significance qualifications and suggests revisions where necessary. Results: We tested our method by generating a sub-genus level classifier with which we re-evaluated classifications of 676 microbial organisms using the NCBI FTP database for the 16S rRNA. The results confirm current classification for all genera while ascertaining significance at 95%. Furthermore, this novel classifier isolates heterogeneity issues to a mere 12 strains while confirming classifications with significance qualification for the remaining 98%. The models require less memory than that needed by multisequence alignments and have better time complexity than the current methods. The classifier operates at sub-genus level, and thus outperforms the naive Bayes classifier of the RNA Database Project where much of the taxonomic analysis is available online. Finally, using information redundancy in model building, we show that the method applies to metagenomic fragment classification of 19 Escherichia coli strains.
UR - http://www.scopus.com/inward/record.url?scp=84983191295&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btq349
DO - 10.1093/bioinformatics/btq349
M3 - Conference article
AN - SCOPUS:84983191295
SN - 1367-4803
VL - 27
SP - 2235
EP - 2241
JO - Bioinformatics
JF - Bioinformatics
IS - 13
T2 - 19th Annual International Conference on Intelligent Systems for Molecular Biology, Joint with the 10th European Conference on Computational Biology, ISMB/ECCB 2011
Y2 - 17 July 2011 through 19 July 2011
ER -