TY - JOUR
T1 - Emotion recognition in live broadcasting
T2 - a multimodal deep learning framework
AU - Abbas, Rizwan
AU - Schuller, Björn W.
AU - Li, Xuewei
AU - Lin, Chi
AU - Li, Xi
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2025.
PY - 2025/6
Y1 - 2025/6
N2 - Multimodal emotion recognition is a rapidly developing field with applications across diverse fields such as entertainment, healthcare, marketing, and education. The emergence of live broadcasting demands real-time emotion recognition, which involves analyzing emotions via body language, voice, facial expressions, and context. Previous studies have faced challenges associated with multimodal emotion recognition in live broadcasting, such as computational efficiency, noisy and incomplete data, and difficult camera angles. This research presents a Multimodal Emotion Recognition in Live Broadcasting (MERLB) system that collects speech, facial expressions, and context displayed in live broadcasting for emotion recognition. We utilize a deep convolutional neural network architecture for facial emotion recognition, incorporating inception modules and dense blocks. We aim to enhance computational efficiency by focusing on key segments rather than analyzing the entire utterance. MERLB employs tensor train layers to combine multimodal representations at higher orders. Experiments were conducted on the FIFA, League of Legends, IEMOCAP, and CMU-MOSEI datasets. MERLB achieves a 6.44% F1 score improvement on the FIFA dataset and 4.71% on League of Legends, outperforming other multi-modal emotion methods on IEMOCAP and CMU-MOSEI datasets. Our code is available at https://github.com/swerizwan/merlb.
AB - Multimodal emotion recognition is a rapidly developing field with applications across diverse fields such as entertainment, healthcare, marketing, and education. The emergence of live broadcasting demands real-time emotion recognition, which involves analyzing emotions via body language, voice, facial expressions, and context. Previous studies have faced challenges associated with multimodal emotion recognition in live broadcasting, such as computational efficiency, noisy and incomplete data, and difficult camera angles. This research presents a Multimodal Emotion Recognition in Live Broadcasting (MERLB) system that collects speech, facial expressions, and context displayed in live broadcasting for emotion recognition. We utilize a deep convolutional neural network architecture for facial emotion recognition, incorporating inception modules and dense blocks. We aim to enhance computational efficiency by focusing on key segments rather than analyzing the entire utterance. MERLB employs tensor train layers to combine multimodal representations at higher orders. Experiments were conducted on the FIFA, League of Legends, IEMOCAP, and CMU-MOSEI datasets. MERLB achieves a 6.44% F1 score improvement on the FIFA dataset and 4.71% on League of Legends, outperforming other multi-modal emotion methods on IEMOCAP and CMU-MOSEI datasets. Our code is available at https://github.com/swerizwan/merlb.
KW - Facial expressions
KW - Multimodal emotion recognition
KW - Speech emotion
KW - Tensor train layers
UR - http://www.scopus.com/inward/record.url?scp=105005262208&partnerID=8YFLogxK
U2 - 10.1007/s00530-025-01780-y
DO - 10.1007/s00530-025-01780-y
M3 - Article
AN - SCOPUS:105005262208
SN - 0942-4962
VL - 31
JO - Multimedia Systems
JF - Multimedia Systems
IS - 3
M1 - 253
ER -