TY - GEN
T1 - From Speech to Facial Activity
T2 - 21st IEEE International Workshop on Multimedia Signal Processing, MMSP 2019
AU - Stappen, Lukas
AU - Karas, Vincent
AU - Cummins, Nicholas
AU - Ringeval, Fabien
AU - Scherer, Klaus
AU - Schuller, Björn
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/9
Y1 - 2019/9
N2 - Multimodal data sources offer the possibility to capture and model interactions between modalities, leading to an improved understanding of underlying relationships. In this regard, the work presented in this paper explores the relationship between facial muscle movements and speech signals. Specifically, we explore the efficacy of different sequence-to-sequence neural network architectures for the task of predicting Facial Action Coding System Action Units (AUs) from one of two acoustic feature representations extracted from speech signals, namely the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPs) or the Interspeech Computational Paralinguistics Challenge features set (ComParE). Furthermore, these architectures were enhanced by two different attention mechanisms (intra- and inter-attention) and various state-of-the-art network settings to improve prediction performance. Results indicate that a sequence-to-sequence model with inter-attention can achieve on average an Unweighted Average Recall (UAR) of 65.9 % for AU onset, 67.8 % for AU apex (both eGeMAPs), 79.7 % for AU offset and 65.3 % for AU occurrence (both ComParE) detection over all AUs.
AB - Multimodal data sources offer the possibility to capture and model interactions between modalities, leading to an improved understanding of underlying relationships. In this regard, the work presented in this paper explores the relationship between facial muscle movements and speech signals. Specifically, we explore the efficacy of different sequence-to-sequence neural network architectures for the task of predicting Facial Action Coding System Action Units (AUs) from one of two acoustic feature representations extracted from speech signals, namely the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPs) or the Interspeech Computational Paralinguistics Challenge features set (ComParE). Furthermore, these architectures were enhanced by two different attention mechanisms (intra- and inter-attention) and various state-of-the-art network settings to improve prediction performance. Results indicate that a sequence-to-sequence model with inter-attention can achieve on average an Unweighted Average Recall (UAR) of 65.9 % for AU onset, 67.8 % for AU apex (both eGeMAPs), 79.7 % for AU offset and 65.3 % for AU occurrence (both ComParE) detection over all AUs.
KW - attention networks
KW - facial action units
KW - paralingustics
KW - sequence to sequence
UR - http://www.scopus.com/inward/record.url?scp=85075713927&partnerID=8YFLogxK
U2 - 10.1109/MMSP.2019.8901779
DO - 10.1109/MMSP.2019.8901779
M3 - Conference contribution
AN - SCOPUS:85075713927
T3 - IEEE 21st International Workshop on Multimedia Signal Processing, MMSP 2019
BT - IEEE 21st International Workshop on Multimedia Signal Processing, MMSP 2019
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 27 September 2019 through 29 September 2019
ER -