Representation transfer learning from deep end-to-end speech recognition networks for the classification of health states from speech

Benjamin Sertolli, Zhao Ren, Björn W. Schuller, Nicholas Cummins

Research output: Contribution to journalArticlepeer-review

21 Scopus citations

Abstract

Representation transfer learning has been widely used across a range of machine learning tasks. One such notable approach seen in the speech literature is the use of Convolutional Neural Networks, pre-trained for image classification tasks, to extract features from spectrograms of speech signals. Interestingly, despite the strong performance of such approaches, there have been minimal research efforts exploring the suitability of using speech-specific networks to perform feature extraction. In this regard, a novel feature representation learning framework is presented herein. This approach is comprising the use of Automatic Speech Recognition (ASR) deep neural networks as feature extractors, the fusion of several extracted feature representations using Compact Bilinear Pooling (CBP), and finally inference via a specially optimised Recurrent Neural Network (RNN) classifier. To determine the usefulness of these feature representations, they are comprehensively tested on two representative speech-health classification tasks, namely the food-type being eaten and speaker intoxication. Key results indicate the promise of the extracted features, demonstrating comparable results to other state-of-the-art approaches in the literature.

Original languageEnglish
Article number101204
JournalComputer Speech and Language
Volume68
DOIs
StatePublished - Jul 2021
Externally publishedYes

Keywords

  • Automatic speech recognition
  • Compact bilinear pooling
  • Computational paralinguistics
  • Recurrent neural networks
  • Representation learning
  • Transfer learning

Fingerprint

Dive into the research topics of 'Representation transfer learning from deep end-to-end speech recognition networks for the classification of health states from speech'. Together they form a unique fingerprint.

Cite this