Abstract
Emotion modelling in speech using deep reinforcement learning (RL) has gained attention within the speech-emotion recognition (SER) community. However, prior studies have primarily centred around recurrent neural networks (RNNs) to capture emotional contexts, with limited exploration of the potential offered by more recent transformer architectures. This paper explores a comprehensive evaluation of training a transformer-based model using deep RL and benchmark its efficacy in SER. Specifically, we explore the effectiveness of a pre-trained Wav2vec2 (w2v2) model-based classifier within the deep RL setting. We evaluate the proposed deep RL framework using five publicly available datasets and benchmark the results with three recent SER studies using two deep RL methods. Based on the results, we show that the transformer-based RL agent not only demonstrates an improvement in SER accuracy but also shows a reduction in the time taken to begin emotion classification, outpacing the RNNs that have been commonly used to date. Moreover, by leveraging pre-trained transformers, we observe a reduced need for extensive pre-training which has been a norm in prior research.
| Original language | English |
|---|---|
| Pages (from-to) | 1600-1604 |
| Number of pages | 5 |
| Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| DOIs | |
| State | Published - 2024 |
| Event | 25th Interspeech Conferece 2024 - Kos Island, Greece Duration: 1 Sep 2024 → 5 Sep 2024 |
Keywords
- computational paralinguistics
- human-computer interaction
- reinforcement learning
- speech emotion recognition
Fingerprint
Dive into the research topics of 'Evaluating Transformer-Enhanced Deep Reinforcement Learning for Speech Emotion Recognition'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver