Automated classification of children's linguistic versus non-linguistic vocalisations

Zixing Zhang, Alejandrina Cristia, Anne S. Warlaumont, Björn Schuller

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

A key outstanding task for speech technology involves dealing with non-standard speakers, notably young children. Distinguishing children's linguistic from non-linguistic vocalisations is crucial for a number of applied and fundamental research goals, and yet there are few systems available for such a classification. This paper investigates two large-scale frame-level acoustic feature sets (eGeMAPS and ComParE16) followed by a dynamic model (GRU-RNN), and two kinds of derived static feature sets on the segment level (functional-based and Bag of Audio Words) combined with a static model (SVM), and automatically learnt representations directly from original raw voice signals by using an end-to-end system. These are applied to a large database of children's vocalisations (total N = 6,298) drawn from daylong recordings gathered in Namibia, Bolivia, and Vanuatu. Among these systems, the one implemented with GRU-RNN using ComParE16 features empirically performs best. We further identify promising paths of further research, including the application of a finer-grained classification of children's vocalisations onto these data, and the exploration of other feature systems.

Original languageEnglish
Pages (from-to)2588-2592
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2018-September
DOIs
StatePublished - 2018
Externally publishedYes
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: 2 Sep 20186 Sep 2018

Keywords

  • Babbling
  • Bag of audio words
  • Crying
  • End-to-end
  • Infancy
  • Language acquisition
  • Large-scale feature set
  • Linguistic vocalisations

Fingerprint

Dive into the research topics of 'Automated classification of children's linguistic versus non-linguistic vocalisations'. Together they form a unique fingerprint.

Cite this