TY - JOUR
T1 - Evaluating Transformer-Enhanced Deep Reinforcement Learning for Speech Emotion Recognition
AU - Latif, Siddique
AU - Jurdak, Raja
AU - Schuller, Björn W.
N1 - Publisher Copyright:
© 2024 International Speech Communication Association. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Emotion modelling in speech using deep reinforcement learning (RL) has gained attention within the speech-emotion recognition (SER) community. However, prior studies have primarily centred around recurrent neural networks (RNNs) to capture emotional contexts, with limited exploration of the potential offered by more recent transformer architectures. This paper explores a comprehensive evaluation of training a transformer-based model using deep RL and benchmark its efficacy in SER. Specifically, we explore the effectiveness of a pre-trained Wav2vec2 (w2v2) model-based classifier within the deep RL setting. We evaluate the proposed deep RL framework using five publicly available datasets and benchmark the results with three recent SER studies using two deep RL methods. Based on the results, we show that the transformer-based RL agent not only demonstrates an improvement in SER accuracy but also shows a reduction in the time taken to begin emotion classification, outpacing the RNNs that have been commonly used to date. Moreover, by leveraging pre-trained transformers, we observe a reduced need for extensive pre-training which has been a norm in prior research.
AB - Emotion modelling in speech using deep reinforcement learning (RL) has gained attention within the speech-emotion recognition (SER) community. However, prior studies have primarily centred around recurrent neural networks (RNNs) to capture emotional contexts, with limited exploration of the potential offered by more recent transformer architectures. This paper explores a comprehensive evaluation of training a transformer-based model using deep RL and benchmark its efficacy in SER. Specifically, we explore the effectiveness of a pre-trained Wav2vec2 (w2v2) model-based classifier within the deep RL setting. We evaluate the proposed deep RL framework using five publicly available datasets and benchmark the results with three recent SER studies using two deep RL methods. Based on the results, we show that the transformer-based RL agent not only demonstrates an improvement in SER accuracy but also shows a reduction in the time taken to begin emotion classification, outpacing the RNNs that have been commonly used to date. Moreover, by leveraging pre-trained transformers, we observe a reduced need for extensive pre-training which has been a norm in prior research.
KW - computational paralinguistics
KW - human-computer interaction
KW - reinforcement learning
KW - speech emotion recognition
UR - http://www.scopus.com/inward/record.url?scp=85214795071&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2024-1827
DO - 10.21437/Interspeech.2024-1827
M3 - Conference article
AN - SCOPUS:85214795071
SN - 2308-457X
SP - 1600
EP - 1604
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 25th Interspeech Conferece 2024
Y2 - 1 September 2024 through 5 September 2024
ER -