TY - GEN
T1 - Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks
AU - Wöllmer, Martin
AU - Eyben, Florian
AU - Keshet, Joseph
AU - Graves, Alex
AU - Schuller, Björn
AU - Rigoll, Gerhard
PY - 2009
Y1 - 2009
N2 - In this paper we propose a new technique for robust keyword spotting that uses bidirectional Long Short-Term Memory (BLSTM) recurrent neural nets to incorporate contextual information in speech decoding. Our approach overcomes the drawbacks of generative HMM modeling by applying a discriminative learning procedure that non-linearly maps speech features into an abstract vector space. By incorporating the outputs of a BLSTM network into the speech features, it is able to make use of past and future context for phoneme predictions. The robustness of the approach is evaluated on a keyword spotting task using the HUMAINE Sensitive Artificial Listener (SAL) database, which contains accented, spontaneous, and emotionally colored speech. The test is particularly stringent because the system is not trained on the SAL database, but only on the TIMIT corpus of read speech. We show that our method prevails over a discriminative keyword spotter without BLSTM-enhanced feature functions, which in turn has been proven to outperform HMM-based techniques.
AB - In this paper we propose a new technique for robust keyword spotting that uses bidirectional Long Short-Term Memory (BLSTM) recurrent neural nets to incorporate contextual information in speech decoding. Our approach overcomes the drawbacks of generative HMM modeling by applying a discriminative learning procedure that non-linearly maps speech features into an abstract vector space. By incorporating the outputs of a BLSTM network into the speech features, it is able to make use of past and future context for phoneme predictions. The robustness of the approach is evaluated on a keyword spotting task using the HUMAINE Sensitive Artificial Listener (SAL) database, which contains accented, spontaneous, and emotionally colored speech. The test is particularly stringent because the system is not trained on the SAL database, but only on the TIMIT corpus of read speech. We show that our method prevails over a discriminative keyword spotter without BLSTM-enhanced feature functions, which in turn has been proven to outperform HMM-based techniques.
KW - Recurrent neural networks
KW - Robustness
KW - Speech recognition
UR - http://www.scopus.com/inward/record.url?scp=70349203870&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2009.4960492
DO - 10.1109/ICASSP.2009.4960492
M3 - Conference contribution
AN - SCOPUS:70349203870
SN - 9781424423545
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 3949
EP - 3952
BT - 2009 IEEE International Conference on Acoustics, Speech, and Signal Processing - Proceedings, ICASSP 2009
T2 - 2009 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2009
Y2 - 19 April 2009 through 24 April 2009
ER -