TY - GEN
T1 - An image-based deep spectrum feature representation for the recognition of emotional speech
AU - Cummins, Nicholas
AU - Amiriparian, Shahin
AU - Hagerer, Gerhard
AU - Batliner, Anton
AU - Steidl, Stefan
AU - Schuller, Björn W.
N1 - Publisher Copyright:
© 2017 ACM.
PY - 2017/10/23
Y1 - 2017/10/23
N2 - The outputs of the higher layers of deep pre-trained convolutional neural networks (CNNs) have consistently been shown to provide a rich representation of an image for use in recognition tasks. This study explores the suitability of such an approach for speech-based emotion recognition tasks. First, we detail a new acoustic feature representation, denoted as deep spectrum features, derived from feeding spectrograms through a very deep image classification CNN and forming a feature vector from the activations of the last fully connected layer. We then compare the performance of our novel features with standardised brute-force and bag-of-audio-words (BoAW) acoustic feature representations for 2- and 5-class speech-based emotion recognition in clean, noisy and denoised conditions. The presented results show that image-based approaches are a promising avenue of research for speech-based recognition tasks. Key results indicate that deep-spectrum features are comparable in performance with the other tested acoustic feature representations in matched for noise type train-test conditions; however, the BoAW paradigm is better suited to cross-noise-type train-test conditions.
AB - The outputs of the higher layers of deep pre-trained convolutional neural networks (CNNs) have consistently been shown to provide a rich representation of an image for use in recognition tasks. This study explores the suitability of such an approach for speech-based emotion recognition tasks. First, we detail a new acoustic feature representation, denoted as deep spectrum features, derived from feeding spectrograms through a very deep image classification CNN and forming a feature vector from the activations of the last fully connected layer. We then compare the performance of our novel features with standardised brute-force and bag-of-audio-words (BoAW) acoustic feature representations for 2- and 5-class speech-based emotion recognition in clean, noisy and denoised conditions. The presented results show that image-based approaches are a promising avenue of research for speech-based recognition tasks. Key results indicate that deep-spectrum features are comparable in performance with the other tested acoustic feature representations in matched for noise type train-test conditions; however, the BoAW paradigm is better suited to cross-noise-type train-test conditions.
KW - Computational paralinguistics
KW - Convolutional neural networks
KW - Emotions
KW - Image recognition
KW - Realism
KW - Spectral features
UR - http://www.scopus.com/inward/record.url?scp=85035242905&partnerID=8YFLogxK
U2 - 10.1145/3123266.3123371
DO - 10.1145/3123266.3123371
M3 - Conference contribution
AN - SCOPUS:85035242905
T3 - MM 2017 - Proceedings of the 2017 ACM Multimedia Conference
SP - 478
EP - 484
BT - MM 2017 - Proceedings of the 2017 ACM Multimedia Conference
PB - Association for Computing Machinery, Inc
T2 - 25th ACM International Conference on Multimedia, MM 2017
Y2 - 23 October 2017 through 27 October 2017
ER -