TY - GEN
T1 - Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech
AU - Zhang, Zixing
AU - Wu, Bingwen
AU - Schuller, Bjorn
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/5
Y1 - 2019/5
N2 - Despite the increasing research interest in end-to-end learning systems for speech emotion recognition, conventional systems either suffer from the overfitting due in part to the limited training data, or do not explicitly consider the different contributions of automatically learnt representations for a specific task. In this contribution, we propose a novel end-to-end framework which is enhanced by learning other auxiliary tasks and an attention mechanism. That is, we jointly train an end-to-end network with several different but related emotion prediction tasks, i. e., arousal, valence, and dominance predictions, to extract more robust representations shared among various tasks than traditional systems with the hope that it is able to relieve the overfitting problem. Meanwhile, an attention layer is implemented on top of the layers for each task, with the aim to capture the contribution distribution of different segment parts for each individual task. To evaluate the effectiveness of the proposed system, we conducted a set of experiments on the widely used database IEMOCAP. The empirical results show that the proposed systems significantly outperform corresponding baseline systems.
AB - Despite the increasing research interest in end-to-end learning systems for speech emotion recognition, conventional systems either suffer from the overfitting due in part to the limited training data, or do not explicitly consider the different contributions of automatically learnt representations for a specific task. In this contribution, we propose a novel end-to-end framework which is enhanced by learning other auxiliary tasks and an attention mechanism. That is, we jointly train an end-to-end network with several different but related emotion prediction tasks, i. e., arousal, valence, and dominance predictions, to extract more robust representations shared among various tasks than traditional systems with the hope that it is able to relieve the overfitting problem. Meanwhile, an attention layer is implemented on top of the layers for each task, with the aim to capture the contribution distribution of different segment parts for each individual task. To evaluate the effectiveness of the proposed system, we conducted a set of experiments on the widely used database IEMOCAP. The empirical results show that the proposed systems significantly outperform corresponding baseline systems.
KW - Speech emotion prediction
KW - attention mechanism
KW - end-to-end
KW - multi-task learning
UR - http://www.scopus.com/inward/record.url?scp=85068130177&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2019.8682896
DO - 10.1109/ICASSP.2019.8682896
M3 - Conference contribution
AN - SCOPUS:85068130177
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 6705
EP - 6709
BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
Y2 - 12 May 2019 through 17 May 2019
ER -