Low-level fusion of audio and video feature for multi-modal emotion recognition

Matthias Wimmer, Björn Schuller, Dejan Arsic, Gerhard Rigoll, Bernd Radig

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

50 Scopus citations

Abstract

Bimodal emotion recognition through audiovisual feature fusion has been shown superior over each individual modality in the past. Still, synchronization of the two streams is a challenge, as many vision approaches work on a frame basis opposing audio turn- or chunk-basis. Therefore, late fusion schemes such as simple logic or voting strategies are commonly used for the overall estimation of underlying affect. However, early fusion is known to be more effective in many other multimodal recognition tasks. We therefore suggest a combined analysis by descriptive statistics of audio and video Low-Level-Descriptors for subsequent static SVM Classification. This strategy also allows for a combined feature-space optimization which will be discussed herein. The high effectiveness of this approach is shown on a database of 11.5h containing six emotional situations in an airplane scenario.

Original languageEnglish
Title of host publicationVISAPP 2008 - 3rd International Conference on Computer Vision Theory and Applications, Proceedings
Pages145-151
Number of pages7
StatePublished - 2008
Event3rd International Conference on Computer Vision Theory and Applications, VISAPP 2008 - Funchal, Madeira, Portugal
Duration: 22 Jan 200825 Jan 2008

Publication series

NameVISAPP 2008 - 3rd International Conference on Computer Vision Theory and Applications, Proceedings
Volume2

Conference

Conference3rd International Conference on Computer Vision Theory and Applications, VISAPP 2008
Country/TerritoryPortugal
CityFunchal, Madeira
Period22/01/0825/01/08

Keywords

  • Audio-visual processing
  • Emotion recognition
  • Multi-modal fusion

Fingerprint

Dive into the research topics of 'Low-level fusion of audio and video feature for multi-modal emotion recognition'. Together they form a unique fingerprint.

Cite this