TY - GEN
T1 - Time-Continuous Audiovisual Fusion with Recurrence vs Attention for In-The-Wild Affect Recognition
AU - Karas, Vincent
AU - Tellamekala, Mani Kumar
AU - Mallol-Ragolta, Adria
AU - Valstar, Michel
AU - Schuller, Bjorn W.
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - This paper presents our contribution to the 3rd Affective Behavior Analysis in-the-Wild (ABAW) challenge. Exploiting the complementarity among multimodal data streams is of vital importance to recognise dimensional affect from in-the-wild audiovisual data, as the contribution affect-wise of the involved modalities might change over time. Recurrence and attention are two of the most widely used modelling mechanisms in the literature for capturing the temporal dependencies of audiovisual data sequences. To clearly understand the performance differences between recurrent and attention models in audiovisual affect recognition, we present a comprehensive evaluation of fusion models based on LSTM-RNNs, self-attention, and cross-modal attention, trained for valence and arousal estimation. Particularly, we study the impact of some key design choices: the modelling complexity of CNN backbones that provide features to temporal models, with and without end-to-end learning. We train the audiovisual affect recognition models on the in-the-wild Aff-wild2 corpus by systematically tuning the hyper-parameters involved in the network architecture design and training optimisation. Our extensive evaluation of the audiovisual fusion models indicate that under various experimental settings, compared to RNNs, attention models may not necessarily be the optimal choice for time-continuous multimodal fusion for emotion recognition.
AB - This paper presents our contribution to the 3rd Affective Behavior Analysis in-the-Wild (ABAW) challenge. Exploiting the complementarity among multimodal data streams is of vital importance to recognise dimensional affect from in-the-wild audiovisual data, as the contribution affect-wise of the involved modalities might change over time. Recurrence and attention are two of the most widely used modelling mechanisms in the literature for capturing the temporal dependencies of audiovisual data sequences. To clearly understand the performance differences between recurrent and attention models in audiovisual affect recognition, we present a comprehensive evaluation of fusion models based on LSTM-RNNs, self-attention, and cross-modal attention, trained for valence and arousal estimation. Particularly, we study the impact of some key design choices: the modelling complexity of CNN backbones that provide features to temporal models, with and without end-to-end learning. We train the audiovisual affect recognition models on the in-the-wild Aff-wild2 corpus by systematically tuning the hyper-parameters involved in the network architecture design and training optimisation. Our extensive evaluation of the audiovisual fusion models indicate that under various experimental settings, compared to RNNs, attention models may not necessarily be the optimal choice for time-continuous multimodal fusion for emotion recognition.
UR - https://www.scopus.com/pages/publications/85137807235
U2 - 10.1109/CVPRW56347.2022.00266
DO - 10.1109/CVPRW56347.2022.00266
M3 - Conference contribution
AN - SCOPUS:85137807235
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 2381
EP - 2390
BT - Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022
PB - IEEE Computer Society
T2 - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022
Y2 - 19 June 2022 through 20 June 2022
ER -