A Residual Multi-Scale Convolutional Neural Network with Transformers for Speech Emotion Recognition

Tianhao Yan, Hao Meng, Emilia Parada-Cabaleiro, Jianhua Tao, Taihao Li, Bjorn W. Schuller

Research output: Contribution to journalArticlepeer-review

Abstract

The great variety of human emotional expression as well as the differences in the ways they perceive and annotate them make Speech Emotion Recognition (SER) an ambiguous and challenging task. With the development of deep learning, long-term progress has been made in SER systems. However, the existing convolutional neural networks present certain limitations, such as their inability to well capture features, which contain important emotional information. Moreover, the position encoding in the Transformer structure is relatively fixed and only encodes the time domain dimension, which cannot effectively obtain the position information of discriminative features in the frequency domain dimension. In order to overtake these limitations, we propose an end-to-end Residual Multi-Scale Convolutional Neural Networks (RMSCNN) with Transformer model network. Simultaneously, to further validate the effectiveness of RMSCNN in extracting multi-scale features and delivering pertinent emotion localization data, we developed the RMSC_down network in conjunction with the Wav2Vec 2.0 model. The results of the prediction of Arousal, Valence and Dominance on the popular corpora demonstrate the superiority and robustness of our approach for SER , showing an improvement of the recognition accuracy in the public dataset MSP-Podcast 1.9 version. The code is available at this GitHub.

Original languageEnglish
JournalIEEE Transactions on Affective Computing
DOIs
StateAccepted/In press - 2024
Externally publishedYes

Keywords

  • Adaptive Position Encoding
  • Attention Mechanism
  • Residual Multi-Scale CNNs
  • Speech Emotion Recognition

Fingerprint

Dive into the research topics of 'A Residual Multi-Scale Convolutional Neural Network with Transformers for Speech Emotion Recognition'. Together they form a unique fingerprint.

Cite this