TY - GEN
T1 - Audiovisual recognition of spontaneous interest within conversations
AU - Schuller, Björn
AU - Höthker, Anja
AU - Müller, Ronald
AU - Konosu, Hitoshi
AU - Hornier, Benedikt
AU - Rigoll, Gerhard
PY - 2007
Y1 - 2007
N2 - In this work we present an audiovisual approach to the recognition of spontaneous interest in human conversations. For a most robust estimate, information from four sources is combined by a synergistic and individual failure tolerant fusion. Firstly, speech is analyzed with respect to acoustic properties based on a high-dimensional prosodic, articulatory, and voice quality feature space plus the linguistic analysis of spoken content by LVCSR and bag-of-words vector space modeling including non-verbals. Secondly, visual analysis provides patterns of the facial expression by AAMs, and of the movement activity by eye tracking. Experiments base on a database of 10.5h of spontaneous human-to-human conversation containing 20 subjects in gender and age-class balance. Recordings are fulfilled with a room microphone, camera, and headsets for close-talk to consider diverse comfort and noise conditions. Three levels of interest were annotated within a rich transcription. We describe each information stream and a fusion on an early level in detail. Our experiments aim at a person-independent system for real-life usage and show the high potential of such a multimodal approach. Benchmark results based on transcription versus automatic processing are also provided.
AB - In this work we present an audiovisual approach to the recognition of spontaneous interest in human conversations. For a most robust estimate, information from four sources is combined by a synergistic and individual failure tolerant fusion. Firstly, speech is analyzed with respect to acoustic properties based on a high-dimensional prosodic, articulatory, and voice quality feature space plus the linguistic analysis of spoken content by LVCSR and bag-of-words vector space modeling including non-verbals. Secondly, visual analysis provides patterns of the facial expression by AAMs, and of the movement activity by eye tracking. Experiments base on a database of 10.5h of spontaneous human-to-human conversation containing 20 subjects in gender and age-class balance. Recordings are fulfilled with a room microphone, camera, and headsets for close-talk to consider diverse comfort and noise conditions. Three levels of interest were annotated within a rich transcription. We describe each information stream and a fusion on an early level in detail. Our experiments aim at a person-independent system for real-life usage and show the high potential of such a multimodal approach. Benchmark results based on transcription versus automatic processing are also provided.
KW - Affective computing
KW - Audiovisual
KW - Emotion
KW - Interest
UR - http://www.scopus.com/inward/record.url?scp=48249092791&partnerID=8YFLogxK
U2 - 10.1145/1322192.1322201
DO - 10.1145/1322192.1322201
M3 - Conference contribution
AN - SCOPUS:48249092791
SN - 9781595938176
T3 - Proceedings of the 9th International Conference on Multimodal Interfaces, ICMI'07
SP - 30
EP - 37
BT - Proceedings of the 9th International Conference on Multimodal Interfaces, ICMI'07
T2 - 9th International Conference on Multimodal Interfaces, ICMI 2007
Y2 - 12 November 2007 through 15 November 2007
ER -