TY - GEN
T1 - Recognition of spontaneous conversational speech using long Short-Term Memory phoneme predictions
AU - Wollmer, Martin
AU - Eyben, Florian
AU - Schuller, Björn
AU - Rigoll, Gerhard
PY - 2010
Y1 - 2010
N2 - We present a novel continuous speech recognition framework designed to unite the principles of triphone and Long Short-Term Memory (LSTM) modeling. The LSTM principle allows a recurrent neural network to store and to retrieve information over long time periods, which was shown to be well-suited for the modeling of co-articulation effects in human speech. Our system uses a bidirectional LSTM network to generate a phoneme prediction feature that is observed by a triphone-based large-vocabulary continuous speech recognition (LVCSR) decoder, together with conventional MFCC features. We evaluate both, phoneme prediction error rates of various network architectures and the word recognition performance of our Tandem approach using the COSINE database - a large corpus of conversational and noisy speech, and show that incorporating LSTM phoneme predictions in to an LVCSR system leads to significantly higher word accuracies.
AB - We present a novel continuous speech recognition framework designed to unite the principles of triphone and Long Short-Term Memory (LSTM) modeling. The LSTM principle allows a recurrent neural network to store and to retrieve information over long time periods, which was shown to be well-suited for the modeling of co-articulation effects in human speech. Our system uses a bidirectional LSTM network to generate a phoneme prediction feature that is observed by a triphone-based large-vocabulary continuous speech recognition (LVCSR) decoder, together with conventional MFCC features. We evaluate both, phoneme prediction error rates of various network architectures and the word recognition performance of our Tandem approach using the COSINE database - a large corpus of conversational and noisy speech, and show that incorporating LSTM phoneme predictions in to an LVCSR system leads to significantly higher word accuracies.
KW - Context modeling
KW - Large-vocabulary continuous speech recognition
KW - Long Short-Term Memory
KW - Recurrent neural networks
UR - https://www.scopus.com/pages/publications/79959821052
M3 - Conference contribution
AN - SCOPUS:79959821052
T3 - Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010
SP - 1946
EP - 1949
BT - Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010
PB - International Speech Communication Association
ER -