Skip to main navigation Skip to search Skip to main content

Rethinking Auditory Affective Descriptors Through Zero-Shot Emotion Recognition in Speech

  • Xinzhou Xu
  • , Jun Deng
  • , Zixing Zhang
  • , Xijian Fan
  • , Li Zhao
  • , Laurence Devillers
  • , Bjorn W. Schuller
  • Nanjing University of Posts and Telecommunications
  • University Hospital Augsburg
  • Agile Robots AG
  • Imperial College London
  • Nanjing Forestry University
  • Southeast University
  • Centre de Recherche Institut du Cerveau et de la Moelle

Research output: Contribution to journalArticlepeer-review

15 Scopus citations

Abstract

Zero-shot speech emotion recognition (SER) endows machines with the ability of sensing unseen-emotional states in speech, compared with conventional SER endeavors on supervised cases. On addressing the zero-shot SER task, auditory affective descriptors (AADs) are typically employed to transfer affective knowledge from seen- to unseen-emotional states. However, it remains unknown which types of AADs can well describe emotional states in speech during the transfer. In this regard, we define and research on three types of AADs, namely, per-emotion semantic-embedding, per-emotion manually annotated, and per-sample manually annotated AADs, through zero-shot emotion recognition in speech. This leads to a systematic design including prototype- and annotation-based zero-shot SER modules, relying on the input from per-emotion and per-sample AADs, respectively. We then perform extensive experimental comparisons between human and machines' AADs on the French emotional speech corpus CINEMO for positive-negative (PN) and within-negative (WN) tasks. The experimental results indicate that semantic-embedding prototypes from pretrained models can outperform manually annotated emotional dimensions in zero-shot SER. The results further demonstrate that it is possible for machines to understand and describe affective information in speech better than human beings, with the help of sufficient pretrained models.

Original languageEnglish
Pages (from-to)1530-1541
Number of pages12
JournalIEEE Transactions on Computational Social Systems
Volume9
Issue number5
DOIs
StatePublished - 1 Oct 2022
Externally publishedYes

Keywords

  • Auditory affective descriptors (AADs)
  • semantic-embedding prototypes
  • speech emotion recognition (SER)
  • zero-shot emotion recognition

Fingerprint

Dive into the research topics of 'Rethinking Auditory Affective Descriptors Through Zero-Shot Emotion Recognition in Speech'. Together they form a unique fingerprint.

Cite this