TY - GEN
T1 - Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise
AU - Wollmer, Martin
AU - Zhang, Zixing
AU - Weninger, Felix
AU - Schuller, Bjorn
AU - Rigoll, Gerhard
PY - 2013/10/18
Y1 - 2013/10/18
N2 - The recognition of spontaneous speech in highly variable noise is known to be a challenge, especially at low signal-to-noise ratios (SNR). In this paper, we investigate the effect of applying bidirectional Long Short-Term Memory (BLSTM) recurrent neural networks for speech feature enhancement in noisy conditions. BLSTM networks tend to prevail over conventional neural network architectures, whenever the recognition or regression task relies on an intelligent exploitation of temporal context information. We show that BLSTM networks are well-suited for mapping from noisy to clean speech features and that the obtained recognition performance gain is partly complementary to improvements via additional techniques such as speech enhancement by non-negative matrix factorization and probabilistic feature generation by Bottleneck-BLSTM networks. Compared to simple multi-condition training or feature enhancement via standard recurrent neural networks, our BLSTM-based feature enhancement approach leads to remarkable gains in word accuracy in a highly challenging task of recognizing spontaneous speech at SNR levels between -6 and 9 dB.
AB - The recognition of spontaneous speech in highly variable noise is known to be a challenge, especially at low signal-to-noise ratios (SNR). In this paper, we investigate the effect of applying bidirectional Long Short-Term Memory (BLSTM) recurrent neural networks for speech feature enhancement in noisy conditions. BLSTM networks tend to prevail over conventional neural network architectures, whenever the recognition or regression task relies on an intelligent exploitation of temporal context information. We show that BLSTM networks are well-suited for mapping from noisy to clean speech features and that the obtained recognition performance gain is partly complementary to improvements via additional techniques such as speech enhancement by non-negative matrix factorization and probabilistic feature generation by Bottleneck-BLSTM networks. Compared to simple multi-condition training or feature enhancement via standard recurrent neural networks, our BLSTM-based feature enhancement approach leads to remarkable gains in word accuracy in a highly challenging task of recognizing spontaneous speech at SNR levels between -6 and 9 dB.
KW - Long Short-Term Memory
KW - feature enhancement
KW - non-negative matrix factorization
KW - recurrent neural networks
UR - http://www.scopus.com/inward/record.url?scp=84890489927&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2013.6638983
DO - 10.1109/ICASSP.2013.6638983
M3 - Conference contribution
AN - SCOPUS:84890489927
SN - 9781479903566
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 6822
EP - 6826
BT - 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013 - Proceedings
T2 - 2013 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013
Y2 - 26 May 2013 through 31 May 2013
ER -