Hierarchical Component-attention Based Speaker Turn Embedding for Emotion Recognition

Shuo Liu, Jinlong Jiao, Ziping Zhao, Judith DIneley, Nicholas Cummins, Bjorn Schuller

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Traditional discrete-time Speech Emotion Recognition (SER) modelling techniques typically assume that an entire speaker chunk or turn is indicative of its corresponding label. An alternative approach is to assume emotional saliency varies over the course of a speaker turn and use modelling techniques capable of identifying and utilising the most emotionally salient segments, such as those with higher emotional intensity. This strategy has the potential to improve the accuracy of SER systems. Towards this goal, we developed a novel hierarchical recurrent neural network model that produces turn level embeddings for SER. Specifically, we apply two levels of attention to learn to identify salient emotional words in a turn as well as the more informative frames within these words. In a set of experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database, we demonstrate that component-attention is more effective within our hierarchical framework than both standard soft-attention and conventional local-attention. Our best network, a hierarchical component-attention network with an attention scope of seven, achieved an Unweighted Average Recall (UAR) of 65.0 % and a Weighted Average Recall (WAR) of 66.1 %, outperforming other baseline attention approaches on the IEMOCAP database.

Original languageEnglish
Title of host publication2020 International Joint Conference on Neural Networks, IJCNN 2020 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728169262
DOIs
StatePublished - Jul 2020
Externally publishedYes
Event2020 International Joint Conference on Neural Networks, IJCNN 2020 - Virtual, Glasgow, United Kingdom
Duration: 19 Jul 202024 Jul 2020

Publication series

NameProceedings of the International Joint Conference on Neural Networks

Conference

Conference2020 International Joint Conference on Neural Networks, IJCNN 2020
Country/TerritoryUnited Kingdom
CityVirtual, Glasgow
Period19/07/2024/07/20

Keywords

  • Component-attention
  • Hierarchical attention network
  • Speech emotion recognition
  • Turn embedding

Fingerprint

Dive into the research topics of 'Hierarchical Component-attention Based Speaker Turn Embedding for Emotion Recognition'. Together they form a unique fingerprint.

Cite this