Skip to main navigation Skip to search Skip to main content

A Residual Multi-Scale Convolutional Neural Network With Transformers for Speech Emotion Recognition

  • Tianhao Yan
  • , Hao Meng
  • , Emilia Parada-Cabaleiro
  • , Jianhua Tao
  • , Taihao Li
  • , Bjorn W. Schuller
  • Zhejiang Lab
  • Harbin Engineering University
  • Johannes Kepler University Linz
  • Tsinghua University
  • University of Chinese Academy of Sciences
  • University Hospital Augsburg
  • Imperial College London

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

The great variety of human emotional expression as well as the differences in the ways they perceive and annotate them make Speech Emotion Recognition (SER) an ambiguous and challenging task. With the development of deep learning, long-term progress has been made in SER systems. However, the existing convolutional neural networks present certain limitations, such as their inability to well capture global features, which contain important emotional information. Moreover, the position encoding in the Transformer structure is relatively fixed and only encodes the time domain dimension, which cannot effectively obtain the position information of discriminative features in the frequency domain dimension. In order to overtake these limitations, we propose an end-to-end Residual Multi-Scale Convolutional Neural Networks (RMSCNN) with Transformer model network. Simultaneously, to further validate the effectivenessof RMSCNN in extracting multi-scale features and delivering pertinent emotion localization data, we developed the RMSC_down network in conjunction with the Wav2Vec 2.0 model. The results of the prediction of Arousal, Valenceand Dominanceon the popular corpora demonstrate the superiority and robustness of our approach for SER, showing an improvement of the recognition accuracy in the public dataset MSP-Podcast 1.9 version.

Original languageEnglish
Pages (from-to)915-932
Number of pages18
JournalIEEE Transactions on Affective Computing
Volume16
Issue number2
DOIs
StatePublished - 2025
Externally publishedYes

Keywords

  • Speech emotion recognition (SER)
  • adaptive position encoding
  • attention mechanism
  • residual multi-scale CNNs

Fingerprint

Dive into the research topics of 'A Residual Multi-Scale Convolutional Neural Network With Transformers for Speech Emotion Recognition'. Together they form a unique fingerprint.

Cite this