TY - JOUR
T1 - Representation transfer learning from deep end-to-end speech recognition networks for the classification of health states from speech
AU - Sertolli, Benjamin
AU - Ren, Zhao
AU - Schuller, Björn W.
AU - Cummins, Nicholas
N1 - Publisher Copyright:
© 2021
PY - 2021/7
Y1 - 2021/7
N2 - Representation transfer learning has been widely used across a range of machine learning tasks. One such notable approach seen in the speech literature is the use of Convolutional Neural Networks, pre-trained for image classification tasks, to extract features from spectrograms of speech signals. Interestingly, despite the strong performance of such approaches, there have been minimal research efforts exploring the suitability of using speech-specific networks to perform feature extraction. In this regard, a novel feature representation learning framework is presented herein. This approach is comprising the use of Automatic Speech Recognition (ASR) deep neural networks as feature extractors, the fusion of several extracted feature representations using Compact Bilinear Pooling (CBP), and finally inference via a specially optimised Recurrent Neural Network (RNN) classifier. To determine the usefulness of these feature representations, they are comprehensively tested on two representative speech-health classification tasks, namely the food-type being eaten and speaker intoxication. Key results indicate the promise of the extracted features, demonstrating comparable results to other state-of-the-art approaches in the literature.
AB - Representation transfer learning has been widely used across a range of machine learning tasks. One such notable approach seen in the speech literature is the use of Convolutional Neural Networks, pre-trained for image classification tasks, to extract features from spectrograms of speech signals. Interestingly, despite the strong performance of such approaches, there have been minimal research efforts exploring the suitability of using speech-specific networks to perform feature extraction. In this regard, a novel feature representation learning framework is presented herein. This approach is comprising the use of Automatic Speech Recognition (ASR) deep neural networks as feature extractors, the fusion of several extracted feature representations using Compact Bilinear Pooling (CBP), and finally inference via a specially optimised Recurrent Neural Network (RNN) classifier. To determine the usefulness of these feature representations, they are comprehensively tested on two representative speech-health classification tasks, namely the food-type being eaten and speaker intoxication. Key results indicate the promise of the extracted features, demonstrating comparable results to other state-of-the-art approaches in the literature.
KW - Automatic speech recognition
KW - Compact bilinear pooling
KW - Computational paralinguistics
KW - Recurrent neural networks
KW - Representation learning
KW - Transfer learning
UR - http://www.scopus.com/inward/record.url?scp=85101623176&partnerID=8YFLogxK
U2 - 10.1016/j.csl.2021.101204
DO - 10.1016/j.csl.2021.101204
M3 - Article
AN - SCOPUS:85101623176
SN - 0885-2308
VL - 68
JO - Computer Speech and Language
JF - Computer Speech and Language
M1 - 101204
ER -