TY - GEN
T1 - Emotion Recognition in Public Speaking Scenarios Utilising An LSTM-RNN Approach with Attention
AU - Baird, Alice
AU - Amiriparian, Shahin
AU - Milling, Manuel
AU - Schuller, Bjorn W.
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/1/19
Y1 - 2021/1/19
N2 - Speaking in public can be a cause of fear for many people. Research suggests that there are physical markers such as an increased heart rate and vocal tremolo that indicate an individual's state of wellbeing during a public speech. In this study, we explore the advantages of speech-based features for continuous recognition of the emotional dimensions of arousal and valence during a public speaking scenario. Furthermore, we explore biological signal fusion, and perform cross-language (German and English) analysis by training language-independent models and testing them on speech from various native and non-native speaker groupings. For the emotion recognition task itself, we utilise a Long Short-Term Memory - Recurrent Neural Network (LSTM-RNN) architecture with a self-attention layer. When utilising audio-only features and testing with non-native German's speaking German we achieve at best a concordance correlation coefficient (CCC) of 0.640 and 0.491 for arousal and valence, respectively - demonstrating a strong effect for this task from non-native speakers, as well as promise for the suitability of deep learning for continuous emotion recognition in the context of public speaking.
AB - Speaking in public can be a cause of fear for many people. Research suggests that there are physical markers such as an increased heart rate and vocal tremolo that indicate an individual's state of wellbeing during a public speech. In this study, we explore the advantages of speech-based features for continuous recognition of the emotional dimensions of arousal and valence during a public speaking scenario. Furthermore, we explore biological signal fusion, and perform cross-language (German and English) analysis by training language-independent models and testing them on speech from various native and non-native speaker groupings. For the emotion recognition task itself, we utilise a Long Short-Term Memory - Recurrent Neural Network (LSTM-RNN) architecture with a self-attention layer. When utilising audio-only features and testing with non-native German's speaking German we achieve at best a concordance correlation coefficient (CCC) of 0.640 and 0.491 for arousal and valence, respectively - demonstrating a strong effect for this task from non-native speakers, as well as promise for the suitability of deep learning for continuous emotion recognition in the context of public speaking.
KW - affective computing
KW - long short-term memory
KW - public speaking
KW - recurrent neural networks
UR - http://www.scopus.com/inward/record.url?scp=85103959364&partnerID=8YFLogxK
U2 - 10.1109/SLT48900.2021.9383542
DO - 10.1109/SLT48900.2021.9383542
M3 - Conference contribution
AN - SCOPUS:85103959364
T3 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
SP - 397
EP - 402
BT - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021
Y2 - 19 January 2021 through 22 January 2021
ER -