TY - JOUR
T1 - Memory-enhanced neural networks and NMF for robust ASR
AU - Geiger, Jürgen T.
AU - Weninger, Felix
AU - Gemmeke, Jort F.
AU - Wöllmer, Martin
AU - Schuller, Björn
AU - Rigoll, Gerhard
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/6/1
Y1 - 2014/6/1
N2 - In this article we address the problem of distant speech recognition for reverberant noisy environments. Speech enhancement methods, e. g., using non-negative matrix factorization (NMF), are succesful in improving the robustness of ASR systems. Furthermore, discriminative training and feature transformations are employed to increase the robustness of traditional systems using Gaussian mixture models (GMM). On the other hand, acoustic models based on deep neural networks (DNN) were recently shown to outperform GMMs. In this work, we combine a state-of-the art GMM system with a deep Long Short-Term Memory (LSTM) recurrent neural network in a double-stream architecture. Such networks use memory cells in the hidden units, enabling them to learn long-range temporal context, and thus increasing the robustness against noise and reverberation. The network is trained to predict frame-wise phoneme estimates, which are converted into observation likelihoods to be used as an acoustic model. It is of particular interest whether the LSTM system is capable of improving a robust state-of-the-art GMM system, which is confirmed in the experimental results. In addition, we investigate the efficiency of NMF for speech enhancement on the front-end side. Experiments are conducted on the medium-vocabulary task of the 2nd 'CHiME' Speech Separation and Recognition Challenge, which includes reverberation and highly variable noise. Experimental results show that the average word error rate of the challenge baseline is reduced by 64% relative. The best challenge entry, a noise-robust state-of-the-art recognition system, is outperformed by 25% relative.
AB - In this article we address the problem of distant speech recognition for reverberant noisy environments. Speech enhancement methods, e. g., using non-negative matrix factorization (NMF), are succesful in improving the robustness of ASR systems. Furthermore, discriminative training and feature transformations are employed to increase the robustness of traditional systems using Gaussian mixture models (GMM). On the other hand, acoustic models based on deep neural networks (DNN) were recently shown to outperform GMMs. In this work, we combine a state-of-the art GMM system with a deep Long Short-Term Memory (LSTM) recurrent neural network in a double-stream architecture. Such networks use memory cells in the hidden units, enabling them to learn long-range temporal context, and thus increasing the robustness against noise and reverberation. The network is trained to predict frame-wise phoneme estimates, which are converted into observation likelihoods to be used as an acoustic model. It is of particular interest whether the LSTM system is capable of improving a robust state-of-the-art GMM system, which is confirmed in the experimental results. In addition, we investigate the efficiency of NMF for speech enhancement on the front-end side. Experiments are conducted on the medium-vocabulary task of the 2nd 'CHiME' Speech Separation and Recognition Challenge, which includes reverberation and highly variable noise. Experimental results show that the average word error rate of the challenge baseline is reduced by 64% relative. The best challenge entry, a noise-robust state-of-the-art recognition system, is outperformed by 25% relative.
KW - Long short-term memory
KW - Multi-stream recognition
KW - Noise robust speech recognition
KW - Non-negative matrix factorization
UR - http://www.scopus.com/inward/record.url?scp=84910095643&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2014.2318514
DO - 10.1109/TASLP.2014.2318514
M3 - Article
AN - SCOPUS:84910095643
SN - 1558-7916
VL - 22
SP - 1037
EP - 1046
JO - IEEE Transactions on Audio, Speech and Language Processing
JF - IEEE Transactions on Audio, Speech and Language Processing
IS - 6
ER -