TY - GEN
T1 - Automatic speaker analysis 2.0
T2 - 9th International Conference on Speech Technology and Human-Computer Dialogue, SpeD 2017
AU - Schuller, Bjorn W.
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/24
Y1 - 2017/7/24
N2 - Automatic Speaker Analysis has largely focused on single aspects of a speaker such as her ID, gender, emotion, personality, or health state. This broadly ignores the interdependency of all the different states and traits impacting on the one single voice production mechanism available to a human speaker. In other words, sometimes we may sound depressed, but we simply have a flu, and hardly find the energy to put more vocal effort into our articulation and sound production. Recently, this lack gave rise to an increasingly holistic speaker analysis - assessing the 'larger picture' in one pass such as by multi-target learning. However, for a robust assessment, this requires large amount of speech and language resources labelled in rich ways to train such interdependency, and architectures able to cope with multi-target learning of massive amounts of speech data. In this light, this contribution will discuss efficient mechanisms such as large socialmedia pre-scanning with dynamic cooperative crowd-sourcing for rapid data collection, cross-task-labelling of these data in a wider range of attributes to reach 'big & rich' speech data, and efficient multi-target end-to-end and end-to-evolution deep learning paradigms to learn an accordingly rich representation of diverse target tasks in efficient ways. The ultimate goal behind is to enable machines to hear the 'entire' person and her condition and whereabouts behind the voice and words - rather than aiming at a single aspect blind to the overall individual and its state, thus leading to the next level of Automatic Speaker Analysis.
AB - Automatic Speaker Analysis has largely focused on single aspects of a speaker such as her ID, gender, emotion, personality, or health state. This broadly ignores the interdependency of all the different states and traits impacting on the one single voice production mechanism available to a human speaker. In other words, sometimes we may sound depressed, but we simply have a flu, and hardly find the energy to put more vocal effort into our articulation and sound production. Recently, this lack gave rise to an increasingly holistic speaker analysis - assessing the 'larger picture' in one pass such as by multi-target learning. However, for a robust assessment, this requires large amount of speech and language resources labelled in rich ways to train such interdependency, and architectures able to cope with multi-target learning of massive amounts of speech data. In this light, this contribution will discuss efficient mechanisms such as large socialmedia pre-scanning with dynamic cooperative crowd-sourcing for rapid data collection, cross-task-labelling of these data in a wider range of attributes to reach 'big & rich' speech data, and efficient multi-target end-to-end and end-to-evolution deep learning paradigms to learn an accordingly rich representation of diverse target tasks in efficient ways. The ultimate goal behind is to enable machines to hear the 'entire' person and her condition and whereabouts behind the voice and words - rather than aiming at a single aspect blind to the overall individual and its state, thus leading to the next level of Automatic Speaker Analysis.
UR - http://www.scopus.com/inward/record.url?scp=85034221067&partnerID=8YFLogxK
U2 - 10.1109/SPED.2017.7990449
DO - 10.1109/SPED.2017.7990449
M3 - Conference contribution
AN - SCOPUS:85034221067
T3 - 2017 9th International Conference on Speech Technology and Human-Computer Dialogue, SpeD 2017
BT - 2017 9th International Conference on Speech Technology and Human-Computer Dialogue, SpeD 2017
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 6 July 2017 through 9 July 2017
ER -