Attention Fusion for Audio-Visual Person Verification Using Multi-Scale Features

Stefan Hormann, Abdul Moiz, Martin Knoche, Gerhard Rigoll

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

10 Scopus citations

Abstract

In the domain of audio-visual person recognition, many approaches use naive fusion techniques, such as scorelevel fusion or concatenation, to fuse the features obtained by face and audio extraction networks. More sophisticated methods fuse both features taking into account the quality of their corresponding inputs. In this paper, we propose a novel architecture to improve the prediction of feature quality. In contrary to previous works, which estimate feature quality based on the features themselves, we combine the information obtained from different layers of the feature extraction networks. In our analysis, we show that our approach outperforms state-of-the-art fusion approaches on well-established benchmarks for multimodal person verification. Moreover, we show that our model is robust against degradation of the visual input.

Original languageEnglish
Title of host publicationProceedings - 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2020
EditorsVitomir Struc, Francisco Gomez-Fernandez
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages281-285
Number of pages5
ISBN (Electronic)9781728130798
DOIs
StatePublished - Nov 2020
Event15th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2020 - Buenos Aires, Argentina
Duration: 16 Nov 202020 Nov 2020

Publication series

NameProceedings - 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2020

Conference

Conference15th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2020
Country/TerritoryArgentina
CityBuenos Aires
Period16/11/2020/11/20

Keywords

  • attention
  • audio visual
  • face recognition
  • fusion
  • multimodal
  • person verfication

Fingerprint

Dive into the research topics of 'Attention Fusion for Audio-Visual Person Verification Using Multi-Scale Features'. Together they form a unique fingerprint.

Cite this