MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Ziping Zhao, Tian Gao, Haishuai Wang, Björn Schuller

Research output: Contribution to journalConference articlepeer-review

Abstract

Emotion recognition in conversation should not rely solely on discovering emotion keywords but also make comprehensive judgments after considering the context.To this end, we propose the MFDR to efficiently integrate acoustic and textual information.Specifically, acoustic-word combination and context perception are modeled sequentially in stages through the Sliding Adaptive Window Attention (SAWA) and Gated Context Perception Unit.More importantly, without additional memory overhead, SAWA allows the perception range to be adaptively adjusted according to the correlation strength to solve the misalignment and information loss caused by window truncation, modeling fusion under variable granularity.Furthermore, emotion refinement through Dynamic Frame Convolution strips out emotion-irrelevant frames, thereby generating a compact and emotionally discriminative fusion representation.The efficacy of MFDR is confirmed by IEMOCAP and CMU-MOSEI, where it demonstrates promising performance.

Original languageEnglish
Pages (from-to)3719-3723
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 2024
Externally publishedYes
Event25th Interspeech Conferece 2024 - Kos Island, Greece
Duration: 1 Sep 20245 Sep 2024

Keywords

  • emotion refinement
  • multi-stages fusion
  • multimodal emotion recognition
  • sliding window

Fingerprint

Dive into the research topics of 'MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition'. Together they form a unique fingerprint.

Cite this