TY - JOUR
T1 - Noise robust ASR in reverberated multisource environments applying convolutive NMF and Long Short-Term Memory
AU - Wöllmer, Martin
AU - Weninger, Felix
AU - Geiger, Jürgen
AU - Schuller, Björn
AU - Rigoll, Gerhard
N1 - Publisher Copyright:
© 2012 Elsevier Ltd.
PY - 2013/5/1
Y1 - 2013/5/1
N2 - This article proposes and evaluates various methods to integrate the concept of bidirectional Long Short-Term Memory (BLSTM) temporal context modeling into a system for automatic speech recognition (ASR) in noisy and reverberated environments. Building on recent advances in Long Short-Term Memory architectures for ASR, we design a novel front-end for context-sensitive Tandem feature extraction and show how the Connectionist Temporal Classification approach can be used as a BLSTM-based back-end, alternatively to Hidden Markov Models (HMM). We combine context-sensitive BLSTM-based feature generation and speech decoding techniques with source separation by convolutive non-negative matrix factorization. Applying our speaker adapted multi-stream HMM framework that processes MFCC features from NMF-enhanced speech as well as word predictions obtained via BLSTM networks and non-negative sparse classification (NSC), we obtain an average accuracy of 91.86% on the PASCAL CHiME Challenge task at signal-to-noise ratios ranging from -6 to 9 dB. To our knowledge, this is the best result ever reported for the CHiME Challenge task.
AB - This article proposes and evaluates various methods to integrate the concept of bidirectional Long Short-Term Memory (BLSTM) temporal context modeling into a system for automatic speech recognition (ASR) in noisy and reverberated environments. Building on recent advances in Long Short-Term Memory architectures for ASR, we design a novel front-end for context-sensitive Tandem feature extraction and show how the Connectionist Temporal Classification approach can be used as a BLSTM-based back-end, alternatively to Hidden Markov Models (HMM). We combine context-sensitive BLSTM-based feature generation and speech decoding techniques with source separation by convolutive non-negative matrix factorization. Applying our speaker adapted multi-stream HMM framework that processes MFCC features from NMF-enhanced speech as well as word predictions obtained via BLSTM networks and non-negative sparse classification (NSC), we obtain an average accuracy of 91.86% on the PASCAL CHiME Challenge task at signal-to-noise ratios ranging from -6 to 9 dB. To our knowledge, this is the best result ever reported for the CHiME Challenge task.
KW - Automatic speech recognition
KW - Long Short-Term Memory
KW - Non-negative matrix factorization
KW - Tandem feature extraction
UR - http://www.scopus.com/inward/record.url?scp=84883396653&partnerID=8YFLogxK
U2 - 10.1016/j.csl.2012.05.002
DO - 10.1016/j.csl.2012.05.002
M3 - Article
AN - SCOPUS:84883396653
SN - 0885-2308
VL - 27
SP - 780
EP - 797
JO - Computer Speech and Language
JF - Computer Speech and Language
IS - 3
M1 - 532
ER -