TY - JOUR
T1 - End-to-end multimodal affect recognition in real-world environments
AU - Tzirakis, Panagiotis
AU - Chen, Jiaxin
AU - Zafeiriou, Stefanos
AU - Schuller, Björn
N1 - Publisher Copyright:
© 2020 Elsevier B.V.
PY - 2021/4
Y1 - 2021/4
N2 - Automatic affect recognition in real-world environments is an important task towards a natural interaction between humans and machines. The recent years, several advancements have been accomplished in determining the emotional states with the use of Deep Neural Networks (DNNs). In this paper, we propose an emotion recognition system that utilizes the raw text, audio and visual information in an end-to-end manner. To capture the emotional states of a person, robust features need to be extracted from the various modalities. To this end, we utilize Convolutional Neural Networks (CNNs) and propose a novel transformer-based architecture for the text modality that can robustly capture the semantics of sentences. We develop an audio model to process the audio channel, and adopt a variation of a high resolution network (HRNet) to process the visual modality. To fuse the modality-specific features, we propose novel attention-based methods. To capture the temporal dynamics in the signal, we utilize Long Short-Term Memory (LSTM) networks. Our model is trained on the SEWA dataset of the AVEC 2017 research sub-challenge on emotion recognition, and produces state-of-the-art results in the text, visual and multimodal domains, and comparable performance in the audio case when compared with the winning papers of the challenge that use several hand-crafted and DNN features. Code is available at: https://github.com/glam-imperial/multimodal-affect-recognition.
AB - Automatic affect recognition in real-world environments is an important task towards a natural interaction between humans and machines. The recent years, several advancements have been accomplished in determining the emotional states with the use of Deep Neural Networks (DNNs). In this paper, we propose an emotion recognition system that utilizes the raw text, audio and visual information in an end-to-end manner. To capture the emotional states of a person, robust features need to be extracted from the various modalities. To this end, we utilize Convolutional Neural Networks (CNNs) and propose a novel transformer-based architecture for the text modality that can robustly capture the semantics of sentences. We develop an audio model to process the audio channel, and adopt a variation of a high resolution network (HRNet) to process the visual modality. To fuse the modality-specific features, we propose novel attention-based methods. To capture the temporal dynamics in the signal, we utilize Long Short-Term Memory (LSTM) networks. Our model is trained on the SEWA dataset of the AVEC 2017 research sub-challenge on emotion recognition, and produces state-of-the-art results in the text, visual and multimodal domains, and comparable performance in the audio case when compared with the winning papers of the challenge that use several hand-crafted and DNN features. Code is available at: https://github.com/glam-imperial/multimodal-affect-recognition.
KW - Deep learning
KW - Emotion recognition
KW - Multimodal machine learning
KW - Sentiment analysis
UR - http://www.scopus.com/inward/record.url?scp=85095916539&partnerID=8YFLogxK
U2 - 10.1016/j.inffus.2020.10.011
DO - 10.1016/j.inffus.2020.10.011
M3 - Article
AN - SCOPUS:85095916539
SN - 1566-2535
VL - 68
SP - 46
EP - 53
JO - Information Fusion
JF - Information Fusion
ER -