Abstract
A key outstanding task for speech technology involves dealing with non-standard speakers, notably young children. Distinguishing children's linguistic from non-linguistic vocalisations is crucial for a number of applied and fundamental research goals, and yet there are few systems available for such a classification. This paper investigates two large-scale frame-level acoustic feature sets (eGeMAPS and ComParE16) followed by a dynamic model (GRU-RNN), and two kinds of derived static feature sets on the segment level (functional-based and Bag of Audio Words) combined with a static model (SVM), and automatically learnt representations directly from original raw voice signals by using an end-to-end system. These are applied to a large database of children's vocalisations (total N = 6,298) drawn from daylong recordings gathered in Namibia, Bolivia, and Vanuatu. Among these systems, the one implemented with GRU-RNN using ComParE16 features empirically performs best. We further identify promising paths of further research, including the application of a finer-grained classification of children's vocalisations onto these data, and the exploration of other feature systems.
Original language | English |
---|---|
Pages (from-to) | 2588-2592 |
Number of pages | 5 |
Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
Volume | 2018-September |
DOIs | |
State | Published - 2018 |
Externally published | Yes |
Event | 19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India Duration: 2 Sep 2018 → 6 Sep 2018 |
Keywords
- Babbling
- Bag of audio words
- Crying
- End-to-end
- Infancy
- Language acquisition
- Large-scale feature set
- Linguistic vocalisations