TY - GEN
T1 - LiRA
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
AU - Ma, Pingchuan
AU - Mira, Rodrigo
AU - Petridis, Stavros
AU - Schuller, Bjorn W.
AU - Pantic, Maja
N1 - Publisher Copyright:
Copyright © 2021 ISCA.
PY - 2021
Y1 - 2021
N2 - The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training objective to learn from the other. In this work, we propose Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild (LRW) dataset and achieves state-of-the-art performance on Lip Reading Sentences 2 (LRS2) using only a fraction of the total labelled data.
AB - The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training objective to learn from the other. In this work, we propose Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild (LRW) dataset and achieves state-of-the-art performance on Lip Reading Sentences 2 (LRS2) using only a fraction of the total labelled data.
KW - Conformer
KW - Lip-reading
KW - Self-supervised learning
KW - Visual representations
KW - Visual speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85119181327&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-1360
DO - 10.21437/Interspeech.2021-1360
M3 - Conference contribution
AN - SCOPUS:85119181327
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 1241
EP - 1245
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
Y2 - 30 August 2021 through 3 September 2021
ER -