TY - GEN
T1 - Emotion Recognition from Speech Signals by Mel-Spectrogram and a CNN-RNN
AU - Sharan, Roneel V.
AU - Mascolo, Cecilia
AU - Schuller, Björn W.
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Speech emotion recognition (SER) in health applications can offer several benefits by providing insights into the emotional well-being of individuals. In this work, we propose a method for SER using time-frequency representation of the speech signals and neural networks. In particular, we divide the speech signals into overlapping segments and transform each segment into a Mel-spectrogram. The Mel-spectrogram forms the input to YAMNet, a pretrained convolutional neural network for audio classification, which learns spectral characteristics within each Mel-spectrogram. In addition, we utilize a long short-term memory network, a type of recurrent neural network, to learn the temporal dependencies between the sequence of Mel-spectrograms in each speech signal. The proposed method is evaluated on angry, happy, and sad emotion types, and the neutral expression, on two SER datasets, achieving an average accuracy of 0.711 and 0.780, respectively. These results are a relative improvement over baseline methods and demonstrate the potential of our method in detecting emotional states using speech signals.
AB - Speech emotion recognition (SER) in health applications can offer several benefits by providing insights into the emotional well-being of individuals. In this work, we propose a method for SER using time-frequency representation of the speech signals and neural networks. In particular, we divide the speech signals into overlapping segments and transform each segment into a Mel-spectrogram. The Mel-spectrogram forms the input to YAMNet, a pretrained convolutional neural network for audio classification, which learns spectral characteristics within each Mel-spectrogram. In addition, we utilize a long short-term memory network, a type of recurrent neural network, to learn the temporal dependencies between the sequence of Mel-spectrograms in each speech signal. The proposed method is evaluated on angry, happy, and sad emotion types, and the neutral expression, on two SER datasets, achieving an average accuracy of 0.711 and 0.780, respectively. These results are a relative improvement over baseline methods and demonstrate the potential of our method in detecting emotional states using speech signals.
KW - Convolutional neural network
KW - Mel-spectrogram
KW - recurrent neural network
KW - speech emotion recognition
UR - http://www.scopus.com/inward/record.url?scp=85214986128&partnerID=8YFLogxK
U2 - 10.1109/EMBC53108.2024.10782952
DO - 10.1109/EMBC53108.2024.10782952
M3 - Conference contribution
AN - SCOPUS:85214986128
T3 - Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS
BT - 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC 2024
Y2 - 15 July 2024 through 19 July 2024
ER -