Abstract
Emotion recognition in conversation should not rely solely on discovering emotion keywords but also make comprehensive judgments after considering the context.To this end, we propose the MFDR to efficiently integrate acoustic and textual information.Specifically, acoustic-word combination and context perception are modeled sequentially in stages through the Sliding Adaptive Window Attention (SAWA) and Gated Context Perception Unit.More importantly, without additional memory overhead, SAWA allows the perception range to be adaptively adjusted according to the correlation strength to solve the misalignment and information loss caused by window truncation, modeling fusion under variable granularity.Furthermore, emotion refinement through Dynamic Frame Convolution strips out emotion-irrelevant frames, thereby generating a compact and emotionally discriminative fusion representation.The efficacy of MFDR is confirmed by IEMOCAP and CMU-MOSEI, where it demonstrates promising performance.
Original language | English |
---|---|
Pages (from-to) | 3719-3723 |
Number of pages | 5 |
Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
DOIs | |
State | Published - 2024 |
Externally published | Yes |
Event | 25th Interspeech Conferece 2024 - Kos Island, Greece Duration: 1 Sep 2024 → 5 Sep 2024 |
Keywords
- emotion refinement
- multi-stages fusion
- multimodal emotion recognition
- sliding window