DEEP SPEAKER CONDITIONING FOR SPEECH EMOTION RECOGNITION

Andreas Triantafyllopoulos, Shuo Liu, Björn W. Schuller

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

33 Scopus citations

Abstract

In this work, we explore the use of speaker conditioning sub-networks for speaker adaptation in a deep neural network (DNN) based speech emotion recognition (SER) system. We use a ResNet architecture trained on log spectrogram features, and augment this architecture with an auxiliary network providing speaker embeddings, which conditions multiple layers of the primary classification network on a single neutral speech sample of the target speaker. The whole system is trained end-to-end using a standard cross-entropy loss for utterance-level SER. Relative to the same architecture without the auxiliary embedding sub-network, we are able to improve by 8.3% on IEMOCAP, and by 5.0% and 30.9% on the 2-class and 5-class SER tasks on FAU-AIBO, respectively.

Original languageEnglish
Title of host publication2021 IEEE International Conference on Multimedia and Expo, ICME 2021
PublisherIEEE Computer Society
ISBN (Electronic)9781665438643
DOIs
StatePublished - 2021
Externally publishedYes
Event2021 IEEE International Conference on Multimedia and Expo, ICME 2021 - Shenzhen, China
Duration: 5 Jul 20219 Jul 2021

Publication series

NameProceedings - IEEE International Conference on Multimedia and Expo
ISSN (Print)1945-7871
ISSN (Electronic)1945-788X

Conference

Conference2021 IEEE International Conference on Multimedia and Expo, ICME 2021
Country/TerritoryChina
CityShenzhen
Period5/07/219/07/21

Keywords

  • affective computing
  • speech emotion recognition

Fingerprint

Dive into the research topics of 'DEEP SPEAKER CONDITIONING FOR SPEECH EMOTION RECOGNITION'. Together they form a unique fingerprint.

Cite this