Skip to main navigation Skip to search Skip to main content

Recognition of spontaneous conversational speech using long Short-Term Memory phoneme predictions

  • Technical University of Munich

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

16 Scopus citations

Abstract

We present a novel continuous speech recognition framework designed to unite the principles of triphone and Long Short-Term Memory (LSTM) modeling. The LSTM principle allows a recurrent neural network to store and to retrieve information over long time periods, which was shown to be well-suited for the modeling of co-articulation effects in human speech. Our system uses a bidirectional LSTM network to generate a phoneme prediction feature that is observed by a triphone-based large-vocabulary continuous speech recognition (LVCSR) decoder, together with conventional MFCC features. We evaluate both, phoneme prediction error rates of various network architectures and the word recognition performance of our Tandem approach using the COSINE database - a large corpus of conversational and noisy speech, and show that incorporating LSTM phoneme predictions in to an LVCSR system leads to significantly higher word accuracies.

Original languageEnglish
Title of host publicationProceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010
PublisherInternational Speech Communication Association
Pages1946-1949
Number of pages4
StatePublished - 2010

Publication series

NameProceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010

Keywords

  • Context modeling
  • Large-vocabulary continuous speech recognition
  • Long Short-Term Memory
  • Recurrent neural networks

Fingerprint

Dive into the research topics of 'Recognition of spontaneous conversational speech using long Short-Term Memory phoneme predictions'. Together they form a unique fingerprint.

Cite this