An image-based deep spectrum feature representation for the recognition of emotional speech

Nicholas Cummins, Shahin Amiriparian, Gerhard Hagerer, Anton Batliner, Stefan Steidl, Björn W. Schuller

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

133 Scopus citations

Abstract

The outputs of the higher layers of deep pre-trained convolutional neural networks (CNNs) have consistently been shown to provide a rich representation of an image for use in recognition tasks. This study explores the suitability of such an approach for speech-based emotion recognition tasks. First, we detail a new acoustic feature representation, denoted as deep spectrum features, derived from feeding spectrograms through a very deep image classification CNN and forming a feature vector from the activations of the last fully connected layer. We then compare the performance of our novel features with standardised brute-force and bag-of-audio-words (BoAW) acoustic feature representations for 2- and 5-class speech-based emotion recognition in clean, noisy and denoised conditions. The presented results show that image-based approaches are a promising avenue of research for speech-based recognition tasks. Key results indicate that deep-spectrum features are comparable in performance with the other tested acoustic feature representations in matched for noise type train-test conditions; however, the BoAW paradigm is better suited to cross-noise-type train-test conditions.

Original languageEnglish
Title of host publicationMM 2017 - Proceedings of the 2017 ACM Multimedia Conference
PublisherAssociation for Computing Machinery, Inc
Pages478-484
Number of pages7
ISBN (Electronic)9781450349062
DOIs
StatePublished - 23 Oct 2017
Externally publishedYes
Event25th ACM International Conference on Multimedia, MM 2017 - Mountain View, United States
Duration: 23 Oct 201727 Oct 2017

Publication series

NameMM 2017 - Proceedings of the 2017 ACM Multimedia Conference

Conference

Conference25th ACM International Conference on Multimedia, MM 2017
Country/TerritoryUnited States
CityMountain View
Period23/10/1727/10/17

Keywords

  • Computational paralinguistics
  • Convolutional neural networks
  • Emotions
  • Image recognition
  • Realism
  • Spectral features

Fingerprint

Dive into the research topics of 'An image-based deep spectrum feature representation for the recognition of emotional speech'. Together they form a unique fingerprint.

Cite this