TY - GEN
T1 - Speech emotion recognition using semantic information
AU - Tzirakis, Panagiotis
AU - Nguyen, Anh
AU - Zafeiriou, Stefanos
AU - Schuller, Björn W.
N1 - Publisher Copyright:
© 2021 IEEE
PY - 2021
Y1 - 2021
N2 - Speech emotion recognition is a crucial problem manifesting in a multitude of applications such as human computer interaction and education. Although several advancements have been made in the recent years, especially with the advent of Deep Neural Networks (DNN), most of the studies in the literature fail to consider the semantic information in the speech signal. In this paper, we propose a novel framework that can capture both the semantic and the paralinguistic information in the signal. In particular, our framework is comprised of a semantic feature extractor, that captures the semantic information, and a paralinguistic feature extractor, that captures the paralinguistic information. Both semantic and paraliguistic features are then combined to a unified representation using a novel attention mechanism. The unified feature vector is passed through a LSTM to capture the temporal dynamics in the signal, before the final prediction. To validate the effectiveness of our framework, we use the popular SEWA dataset of the AVEC challenge series and compare with the three winning papers. Our model provides state-of-the-art results in the valence and liking dimensions.1
AB - Speech emotion recognition is a crucial problem manifesting in a multitude of applications such as human computer interaction and education. Although several advancements have been made in the recent years, especially with the advent of Deep Neural Networks (DNN), most of the studies in the literature fail to consider the semantic information in the speech signal. In this paper, we propose a novel framework that can capture both the semantic and the paralinguistic information in the signal. In particular, our framework is comprised of a semantic feature extractor, that captures the semantic information, and a paralinguistic feature extractor, that captures the paralinguistic information. Both semantic and paraliguistic features are then combined to a unified representation using a novel attention mechanism. The unified feature vector is passed through a LSTM to capture the temporal dynamics in the signal, before the final prediction. To validate the effectiveness of our framework, we use the popular SEWA dataset of the AVEC challenge series and compare with the three winning papers. Our model provides state-of-the-art results in the valence and liking dimensions.1
KW - Audiotextual information
KW - Deep learning
KW - Emotion recognition
KW - Paralinguistic
KW - Semantic
UR - http://www.scopus.com/inward/record.url?scp=85115071289&partnerID=8YFLogxK
U2 - 10.1109/ICASSP39728.2021.9414866
DO - 10.1109/ICASSP39728.2021.9414866
M3 - Conference contribution
AN - SCOPUS:85115071289
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 6279
EP - 6283
BT - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021
Y2 - 6 June 2021 through 11 June 2021
ER -