TY - GEN
T1 - Synthesising 3D Facial Motion from 'In-the-Wild' Speech
AU - Tzirakis, Panagiotis
AU - Papaioannou, Athanasios
AU - Lattas, Alexandros
AU - Tarasiou, Michail
AU - Schuller, Bjorn
AU - Zafeiriou, Stefanos
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/11
Y1 - 2020/11
N2 - Synthesising 3D facial motion from speech is a crucial problem manifesting in a multitude of applications such as computer games and movies. Recently proposed methods tackle this problem in controlled conditions of speech. In this paper, we introduce the first methodology for 3D facial motion synthesis from speech captured in arbitrary recording conditions ('in-the-wild') and independent of the speaker. For our purposes, we captured 4D sequences of people uttering 500 words, contained in the Lip Reading in the Wild (LRW) words, a publicly available large-scale in-the-wild dataset, and built a set of 3D blendshapes appropriate for speech. We correlate the 3D shape parameters of the speech blendshapes to the LRW audio samples by means of a novel time-warping technique, named Deep Canonical Attentional Warping (DCAW), that can simultaneously learn hierarchical non-linear representations and a warping path in an end-to-end manner. We thoroughly evaluate our proposed methods, and show the ability of a deep learning model to synthesise 3D facial motion in handling different speakers and continuous speech signals in uncontrolled conditions1.
AB - Synthesising 3D facial motion from speech is a crucial problem manifesting in a multitude of applications such as computer games and movies. Recently proposed methods tackle this problem in controlled conditions of speech. In this paper, we introduce the first methodology for 3D facial motion synthesis from speech captured in arbitrary recording conditions ('in-the-wild') and independent of the speaker. For our purposes, we captured 4D sequences of people uttering 500 words, contained in the Lip Reading in the Wild (LRW) words, a publicly available large-scale in-the-wild dataset, and built a set of 3D blendshapes appropriate for speech. We correlate the 3D shape parameters of the speech blendshapes to the LRW audio samples by means of a novel time-warping technique, named Deep Canonical Attentional Warping (DCAW), that can simultaneously learn hierarchical non-linear representations and a warping path in an end-to-end manner. We thoroughly evaluate our proposed methods, and show the ability of a deep learning model to synthesise 3D facial motion in handling different speakers and continuous speech signals in uncontrolled conditions1.
UR - http://www.scopus.com/inward/record.url?scp=85101432586&partnerID=8YFLogxK
U2 - 10.1109/FG47880.2020.00100
DO - 10.1109/FG47880.2020.00100
M3 - Conference contribution
AN - SCOPUS:85101432586
T3 - Proceedings - 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2020
SP - 265
EP - 272
BT - Proceedings - 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2020
A2 - Struc, Vitomir
A2 - Gomez-Fernandez, Francisco
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 15th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2020
Y2 - 16 November 2020 through 20 November 2020
ER -