Three recent trends in Paralinguistics on the way to omniscient machine intelligence

Björn W. Schuller, Yue Zhang, Felix Weninger

Research output: Contribution to journalArticlepeer-review

7 Scopus citations

Abstract

A 2 year-old has approximately heard a 1000 h of speech—at the age of ten, around ten thousand. Similarly, automatic speech recognisers are often trained on data in these dimensions. In stark contrast, however, only few databases to train a speaker analysis system contain more than 10 h of speech and hardly ever more than 100 h. Yet, these systems are ideally expected to recognise the states and traits of speakers independent of the person, spoken content, language, cultural background, and acoustic disturbances best at human parity or even superhuman levels. While this is not reached at the time for many tasks such as speaker emotion recognition, deep learning—often described to lead to significant improvements—in combination with sufficient learning data, holds the promise to reach this goal. Luckily, every second, more than 5 h of video are uploaded to the web and several hundreds of hours of audio and video communication in most languages of the world take place. A major effort could thus be invested in efficient labelling and sharing of these. In this contribution, first, benchmarks are given from the nine research challenges co-organised by the authors over the years at the annual Interspeech conference since 2009. Then, approaches to utmost efficient exploitation of the ‘big’ (unlabelled) data available are presented. Small-world modelling in combination with unsupervised learning help to rapidly identify potential target data of interest. Further, gamified crowdsourcing combined with human-machine cooperative learning turns the annotation process into an entertaining experience, while reducing the manual labelling effort to a minimum. Moreover, increasingly autonomous deep holistic end-to-end learning solutions are presented for the tasks at hand. The concluding discussion will contain some crystal ball gazing alongside practical hints not missing out on ethical aspects.

Original languageEnglish
Pages (from-to)273-283
Number of pages11
JournalJournal on Multimodal User Interfaces
Volume12
Issue number4
DOIs
StatePublished - 1 Dec 2018
Externally publishedYes

Keywords

  • Automatic speaker analysis
  • Big data
  • Computational Paralinguistics
  • Deep learning

Fingerprint

Dive into the research topics of 'Three recent trends in Paralinguistics on the way to omniscient machine intelligence'. Together they form a unique fingerprint.

Cite this