TY - JOUR
T1 - Attention-enhanced connectionist temporal classification for discrete speech emotion recognition
AU - Zhao, Ziping
AU - Bao, Zhongtian
AU - Zhang, Zixing
AU - Cummins, Nicholas
AU - Wang, Haishuai
AU - Schuller, Björn
N1 - Publisher Copyright:
Copyright © 2019 ISCA
PY - 2019
Y1 - 2019
N2 - Discrete speech emotion recognition (SER), the assignment of a single emotion label to an entire speech utterance, is typically performed as a sequence-to-label task. This approach, however, is limited, in that it can result in models that do not capture temporal changes in the speech signal, including those indicative of a particular emotion. One potential solution to overcome this limitation is to model SER as a sequence-to-sequence task instead. In this regard, we have developed an attention-based bidirectional long short-term memory (BLSTM) neural network in combination with a connectionist temporal classification (CTC) objective function (Attention-BLSTM-CTC) for SER. We also assessed the benefits of incorporating two contemporary attention mechanisms, namely component attention and quantum attention, into the CTC framework. To the best of the authors' knowledge, this is the first time that such a hybrid architecture has been employed for SER. We demonstrated the effectiveness of our approach on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and FAU-Aibo Emotion corpora. The experimental results demonstrate that our proposed model outperforms current state-of-the-art approaches.
AB - Discrete speech emotion recognition (SER), the assignment of a single emotion label to an entire speech utterance, is typically performed as a sequence-to-label task. This approach, however, is limited, in that it can result in models that do not capture temporal changes in the speech signal, including those indicative of a particular emotion. One potential solution to overcome this limitation is to model SER as a sequence-to-sequence task instead. In this regard, we have developed an attention-based bidirectional long short-term memory (BLSTM) neural network in combination with a connectionist temporal classification (CTC) objective function (Attention-BLSTM-CTC) for SER. We also assessed the benefits of incorporating two contemporary attention mechanisms, namely component attention and quantum attention, into the CTC framework. To the best of the authors' knowledge, this is the first time that such a hybrid architecture has been employed for SER. We demonstrated the effectiveness of our approach on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and FAU-Aibo Emotion corpora. The experimental results demonstrate that our proposed model outperforms current state-of-the-art approaches.
KW - Attention mechanism
KW - Bidirectional LSTM
KW - Connectionist temporal classification
KW - Speech emotion recognition
UR - http://www.scopus.com/inward/record.url?scp=85074727423&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2019-1649
DO - 10.21437/Interspeech.2019-1649
M3 - Conference article
AN - SCOPUS:85074727423
SN - 2308-457X
VL - 2019-September
SP - 206
EP - 210
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019
Y2 - 15 September 2019 through 19 September 2019
ER -