Deep unsupervised representation learning for audio-based medical applications

Shahin Amiriparian, Maximilian Schmitt, Sandra Ottl, Maurice Gerczuk, Björn Schuller

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

1 Scopus citations

Abstract

Feature learning denotes a set of approaches for transforming raw input data into representations that can be effectively utilised in solving machine learning problems. Classifiers or regressors require training data which is computationally suitable to process. However, real-world data, e.g., an audio recording from a group of people talking in a park whilst in the background a dog is barking and a musician is playing the guitar, or health-related data such as coughing and sneezing recorded by consumer smartphones, comprises a remarkably variable and complex nature. For understanding such data, developing expert-designed, hand-crafted features often demands for an exhaustive amount of time and resources. Another disadvantage of such features is the lack of generalisation, i.e., there is a need for re-engineering new features for new tasks. Therefore, it is inevitable to develop automatic representation learning methods. In this chapter, we first discuss the preliminaries of contemporary representation learning techniques for computer audition tasks. Hereby, we differentiate between approaches based on Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). We then introduce and evaluate three state-of-the-art deep learning systems for unsupervised representation learning from raw audio: (1) pre-trained image classification CNNs, (2) a deep convolutional generative adversarial network (DCGAN), and (3) a recurrent sequence-to-sequence autoencoder (S2SAE). For each of these algorithms, the representations are obtained from the spectrograms of the input audio data. Finally, for a range of audio-based machine learning tasks, including abnormal heart sound classification, snore sound classification, and bipolar disorder recognition, we evaluate the efficacy of the deep representations, which are: (i) the activations of the fully connected layers of the pre-trained CNNs, (ii) the activations of the discriminator in case of the DCGAN, and (iii) the activations of a fully connected layer between the encoder and decoder units in case of the S2SAE.

Original languageEnglish
Title of host publicationIntelligent Systems Reference Library
PublisherSpringer
Pages137-164
Number of pages28
DOIs
StatePublished - 2020
Externally publishedYes

Publication series

NameIntelligent Systems Reference Library
Volume186
ISSN (Print)1868-4394
ISSN (Electronic)1868-4408

Fingerprint

Dive into the research topics of 'Deep unsupervised representation learning for audio-based medical applications'. Together they form a unique fingerprint.

Cite this