TY - GEN
T1 - Hierarchical Component-attention Based Speaker Turn Embedding for Emotion Recognition
AU - Liu, Shuo
AU - Jiao, Jinlong
AU - Zhao, Ziping
AU - DIneley, Judith
AU - Cummins, Nicholas
AU - Schuller, Bjorn
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/7
Y1 - 2020/7
N2 - Traditional discrete-time Speech Emotion Recognition (SER) modelling techniques typically assume that an entire speaker chunk or turn is indicative of its corresponding label. An alternative approach is to assume emotional saliency varies over the course of a speaker turn and use modelling techniques capable of identifying and utilising the most emotionally salient segments, such as those with higher emotional intensity. This strategy has the potential to improve the accuracy of SER systems. Towards this goal, we developed a novel hierarchical recurrent neural network model that produces turn level embeddings for SER. Specifically, we apply two levels of attention to learn to identify salient emotional words in a turn as well as the more informative frames within these words. In a set of experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database, we demonstrate that component-attention is more effective within our hierarchical framework than both standard soft-attention and conventional local-attention. Our best network, a hierarchical component-attention network with an attention scope of seven, achieved an Unweighted Average Recall (UAR) of 65.0 % and a Weighted Average Recall (WAR) of 66.1 %, outperforming other baseline attention approaches on the IEMOCAP database.
AB - Traditional discrete-time Speech Emotion Recognition (SER) modelling techniques typically assume that an entire speaker chunk or turn is indicative of its corresponding label. An alternative approach is to assume emotional saliency varies over the course of a speaker turn and use modelling techniques capable of identifying and utilising the most emotionally salient segments, such as those with higher emotional intensity. This strategy has the potential to improve the accuracy of SER systems. Towards this goal, we developed a novel hierarchical recurrent neural network model that produces turn level embeddings for SER. Specifically, we apply two levels of attention to learn to identify salient emotional words in a turn as well as the more informative frames within these words. In a set of experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database, we demonstrate that component-attention is more effective within our hierarchical framework than both standard soft-attention and conventional local-attention. Our best network, a hierarchical component-attention network with an attention scope of seven, achieved an Unweighted Average Recall (UAR) of 65.0 % and a Weighted Average Recall (WAR) of 66.1 %, outperforming other baseline attention approaches on the IEMOCAP database.
KW - Component-attention
KW - Hierarchical attention network
KW - Speech emotion recognition
KW - Turn embedding
UR - http://www.scopus.com/inward/record.url?scp=85093851777&partnerID=8YFLogxK
U2 - 10.1109/IJCNN48605.2020.9207374
DO - 10.1109/IJCNN48605.2020.9207374
M3 - Conference contribution
AN - SCOPUS:85093851777
T3 - Proceedings of the International Joint Conference on Neural Networks
BT - 2020 International Joint Conference on Neural Networks, IJCNN 2020 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 International Joint Conference on Neural Networks, IJCNN 2020
Y2 - 19 July 2020 through 24 July 2020
ER -