Evaluating Transformer-Enhanced Deep Reinforcement Learning for Speech Emotion Recognition

Siddique Latif, Raja Jurdak, Björn W. Schuller

Research output: Contribution to journalConference articlepeer-review

Abstract

Emotion modelling in speech using deep reinforcement learning (RL) has gained attention within the speech-emotion recognition (SER) community. However, prior studies have primarily centred around recurrent neural networks (RNNs) to capture emotional contexts, with limited exploration of the potential offered by more recent transformer architectures. This paper explores a comprehensive evaluation of training a transformer-based model using deep RL and benchmark its efficacy in SER. Specifically, we explore the effectiveness of a pre-trained Wav2vec2 (w2v2) model-based classifier within the deep RL setting. We evaluate the proposed deep RL framework using five publicly available datasets and benchmark the results with three recent SER studies using two deep RL methods. Based on the results, we show that the transformer-based RL agent not only demonstrates an improvement in SER accuracy but also shows a reduction in the time taken to begin emotion classification, outpacing the RNNs that have been commonly used to date. Moreover, by leveraging pre-trained transformers, we observe a reduced need for extensive pre-training which has been a norm in prior research.

Original languageEnglish
Pages (from-to)1600-1604
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 2024
Event25th Interspeech Conferece 2024 - Kos Island, Greece
Duration: 1 Sep 20245 Sep 2024

Keywords

  • computational paralinguistics
  • human-computer interaction
  • reinforcement learning
  • speech emotion recognition

Fingerprint

Dive into the research topics of 'Evaluating Transformer-Enhanced Deep Reinforcement Learning for Speech Emotion Recognition'. Together they form a unique fingerprint.

Cite this