TY - GEN
T1 - Implicit Fusion by Joint Audiovisual Training for Emotion Recognition in Mono Modality
AU - Han, Jing
AU - Zhang, Zixing
AU - Ren, Zhao
AU - Schuller, Bjorn
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/5
Y1 - 2019/5
N2 - Despite significant advances in emotion recognition from one individual modality, previous studies fail to take advantage of other modalities to train models in mono-modal scenarios. In this work, we propose a novel joint training model which implicitly fuses audio and visual information in the training procedure for either speech or facial emotion recognition. Specifically, the model consists of one modality-specific network per individual modality and one shared network to map both audio and visual cues into final predictions. In the training process, we additionally take the loss from one auxiliary modality into account besides the main modality. To evaluate the effectiveness of the implicit fusion model, we conduct extensive experiments for mono-modal emotion classification and regression, and find that the implicit fusion models outperform the standard mono-modal training process.
AB - Despite significant advances in emotion recognition from one individual modality, previous studies fail to take advantage of other modalities to train models in mono-modal scenarios. In this work, we propose a novel joint training model which implicitly fuses audio and visual information in the training procedure for either speech or facial emotion recognition. Specifically, the model consists of one modality-specific network per individual modality and one shared network to map both audio and visual cues into final predictions. In the training process, we additionally take the loss from one auxiliary modality into account besides the main modality. To evaluate the effectiveness of the implicit fusion model, we conduct extensive experiments for mono-modal emotion classification and regression, and find that the implicit fusion models outperform the standard mono-modal training process.
KW - Joint training
KW - audiovisual learning
KW - emotion classification
KW - emotion regression
UR - http://www.scopus.com/inward/record.url?scp=85068984985&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2019.8682773
DO - 10.1109/ICASSP.2019.8682773
M3 - Conference contribution
AN - SCOPUS:85068984985
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 5861
EP - 5865
BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
Y2 - 12 May 2019 through 17 May 2019
ER -