TY - JOUR
T1 - Computational Assessment of Interest in Speech—Facing the Real-Life Challenge
AU - Wöllmer, Martin
AU - Weninger, Felix
AU - Eyben, Florian
AU - Schuller, Björn
N1 - Publisher Copyright:
© 2011, Springer-Verlag.
PY - 2011/8/1
Y1 - 2011/8/1
N2 - Automatic detection of a speaker’s level of interest is of high relevance for many applications, such as automatic customer care, tutoring systems, or affective agents. However, as the latest Interspeech 2010 Paralinguistic Challenge has shown, reliable estimation of non-prototypical natural interest in spontaneous conversations independent of the subject still remains a challenge. In this article, we introduce a fully automatic combination of brute-forced acoustic features, linguistic analysis, and non-linguistic vocalizations, exploiting cross-entity information in an early feature fusion. Linguistic information is based on speech recognition by a multi-stream approach fusing context-sensitive phoneme predictions and standard acoustic features. We provide subject-independent results for interest assessment using Bidirectional Long Short-Term Memory networks on the official Challenge task and show that our proposed system leads to the best recognition accuracies that have ever been reported for this task. The according TUM AVIC corpus consists of highly spontaneous speech from face-to-face commercial presentations. The techniques presented in this article are also used in the SEMAINE system, which features an emotion sensitive embodied conversational agent.
AB - Automatic detection of a speaker’s level of interest is of high relevance for many applications, such as automatic customer care, tutoring systems, or affective agents. However, as the latest Interspeech 2010 Paralinguistic Challenge has shown, reliable estimation of non-prototypical natural interest in spontaneous conversations independent of the subject still remains a challenge. In this article, we introduce a fully automatic combination of brute-forced acoustic features, linguistic analysis, and non-linguistic vocalizations, exploiting cross-entity information in an early feature fusion. Linguistic information is based on speech recognition by a multi-stream approach fusing context-sensitive phoneme predictions and standard acoustic features. We provide subject-independent results for interest assessment using Bidirectional Long Short-Term Memory networks on the official Challenge task and show that our proposed system leads to the best recognition accuracies that have ever been reported for this task. The according TUM AVIC corpus consists of highly spontaneous speech from face-to-face commercial presentations. The techniques presented in this article are also used in the SEMAINE system, which features an emotion sensitive embodied conversational agent.
KW - Affective computing
KW - Interest recognition
KW - Long short-term memory
KW - Recurrent neural networks
UR - http://www.scopus.com/inward/record.url?scp=85087946740&partnerID=8YFLogxK
U2 - 10.1007/s13218-011-0108-9
DO - 10.1007/s13218-011-0108-9
M3 - Article
AN - SCOPUS:85087946740
SN - 0933-1875
VL - 25
SP - 225
EP - 234
JO - KI - Kunstliche Intelligenz
JF - KI - Kunstliche Intelligenz
IS - 3
M1 - 225
ER -