TY - JOUR
T1 - A Residual Multi-Scale Convolutional Neural Network with Transformers for Speech Emotion Recognition
AU - Yan, Tianhao
AU - Meng, Hao
AU - Parada-Cabaleiro, Emilia
AU - Tao, Jianhua
AU - Li, Taihao
AU - Schuller, Bjorn W.
N1 - Publisher Copyright:
© 2010-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - The great variety of human emotional expression as well as the differences in the ways they perceive and annotate them make Speech Emotion Recognition (SER) an ambiguous and challenging task. With the development of deep learning, long-term progress has been made in SER systems. However, the existing convolutional neural networks present certain limitations, such as their inability to well capture features, which contain important emotional information. Moreover, the position encoding in the Transformer structure is relatively fixed and only encodes the time domain dimension, which cannot effectively obtain the position information of discriminative features in the frequency domain dimension. In order to overtake these limitations, we propose an end-to-end Residual Multi-Scale Convolutional Neural Networks (RMSCNN) with Transformer model network. Simultaneously, to further validate the effectiveness of RMSCNN in extracting multi-scale features and delivering pertinent emotion localization data, we developed the RMSC_down network in conjunction with the Wav2Vec 2.0 model. The results of the prediction of Arousal, Valence and Dominance on the popular corpora demonstrate the superiority and robustness of our approach for SER , showing an improvement of the recognition accuracy in the public dataset MSP-Podcast 1.9 version. The code is available at this GitHub.
AB - The great variety of human emotional expression as well as the differences in the ways they perceive and annotate them make Speech Emotion Recognition (SER) an ambiguous and challenging task. With the development of deep learning, long-term progress has been made in SER systems. However, the existing convolutional neural networks present certain limitations, such as their inability to well capture features, which contain important emotional information. Moreover, the position encoding in the Transformer structure is relatively fixed and only encodes the time domain dimension, which cannot effectively obtain the position information of discriminative features in the frequency domain dimension. In order to overtake these limitations, we propose an end-to-end Residual Multi-Scale Convolutional Neural Networks (RMSCNN) with Transformer model network. Simultaneously, to further validate the effectiveness of RMSCNN in extracting multi-scale features and delivering pertinent emotion localization data, we developed the RMSC_down network in conjunction with the Wav2Vec 2.0 model. The results of the prediction of Arousal, Valence and Dominance on the popular corpora demonstrate the superiority and robustness of our approach for SER , showing an improvement of the recognition accuracy in the public dataset MSP-Podcast 1.9 version. The code is available at this GitHub.
KW - Adaptive Position Encoding
KW - Attention Mechanism
KW - Residual Multi-Scale CNNs
KW - Speech Emotion Recognition
UR - http://www.scopus.com/inward/record.url?scp=85207335814&partnerID=8YFLogxK
U2 - 10.1109/TAFFC.2024.3481253
DO - 10.1109/TAFFC.2024.3481253
M3 - Article
AN - SCOPUS:85207335814
SN - 1949-3045
JO - IEEE Transactions on Affective Computing
JF - IEEE Transactions on Affective Computing
ER -