Attention-enhanced connectionist temporal classification for discrete speech emotion recognition

Ziping Zhao, Zhongtian Bao, Zixing Zhang, Nicholas Cummins, Haishuai Wang, Björn Schuller

Publikation: Beitrag in FachzeitschriftKonferenzartikelBegutachtung

73 Zitate (Scopus)

Abstract

Discrete speech emotion recognition (SER), the assignment of a single emotion label to an entire speech utterance, is typically performed as a sequence-to-label task. This approach, however, is limited, in that it can result in models that do not capture temporal changes in the speech signal, including those indicative of a particular emotion. One potential solution to overcome this limitation is to model SER as a sequence-to-sequence task instead. In this regard, we have developed an attention-based bidirectional long short-term memory (BLSTM) neural network in combination with a connectionist temporal classification (CTC) objective function (Attention-BLSTM-CTC) for SER. We also assessed the benefits of incorporating two contemporary attention mechanisms, namely component attention and quantum attention, into the CTC framework. To the best of the authors' knowledge, this is the first time that such a hybrid architecture has been employed for SER. We demonstrated the effectiveness of our approach on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and FAU-Aibo Emotion corpora. The experimental results demonstrate that our proposed model outperforms current state-of-the-art approaches.

OriginalspracheEnglisch
Seiten (von - bis)206-210
Seitenumfang5
FachzeitschriftProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Jahrgang2019-September
DOIs
PublikationsstatusVeröffentlicht - 2019
Extern publiziertJa
Veranstaltung20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, Österreich
Dauer: 15 Sept. 201919 Sept. 2019

Fingerprint

Untersuchen Sie die Forschungsthemen von „Attention-enhanced connectionist temporal classification for discrete speech emotion recognition“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren