TY - JOUR
T1 - Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening
AU - Wöllmer, Martin
AU - Schuller, Björn
AU - Eyben, Florian
AU - Rigoll, Gerhard
PY - 2010/10
Y1 - 2010/10
N2 - The automatic estimation of human affect from the speech signal is an important step towards making virtual agents more natural and human-like. In this paper, we present a novel technique for incremental recognition of the user's emotional state as it is applied in a sensitive artificial listener (SAL) system designed for socially competent humanmachine communication. Our method is capable of using acoustic, linguistic, as well as long-range contextual information in order to continuously predict the current quadrant in a two-dimensional emotional space spanned by the dimensions valence and activation. The main system components are a hierarchical dynamic Bayesian network (DBN) for detecting linguistic keyword features and long short-term memory (LSTM) recurrent neural networks which model phoneme context and emotional history to predict the affective state of the user. Experimental evaluations on the SAL corpus of non-prototypical real-life emotional speech data consider a number of variants of our recognition framework: continuous emotion estimation from low-level feature frames is evaluated as a new alternative to the common approach of computing statistical functionals of given speech turns. Further performance gains are achieved by discriminatively training LSTM networks and by using bidirectional context information, leading to a quadrant prediction F1-measure of up to 51.3 %, which is only 7.6 % below the average inter-labeler consistency.
AB - The automatic estimation of human affect from the speech signal is an important step towards making virtual agents more natural and human-like. In this paper, we present a novel technique for incremental recognition of the user's emotional state as it is applied in a sensitive artificial listener (SAL) system designed for socially competent humanmachine communication. Our method is capable of using acoustic, linguistic, as well as long-range contextual information in order to continuously predict the current quadrant in a two-dimensional emotional space spanned by the dimensions valence and activation. The main system components are a hierarchical dynamic Bayesian network (DBN) for detecting linguistic keyword features and long short-term memory (LSTM) recurrent neural networks which model phoneme context and emotional history to predict the affective state of the user. Experimental evaluations on the SAL corpus of non-prototypical real-life emotional speech data consider a number of variants of our recognition framework: continuous emotion estimation from low-level feature frames is evaluated as a new alternative to the common approach of computing statistical functionals of given speech turns. Further performance gains are achieved by discriminatively training LSTM networks and by using bidirectional context information, leading to a quadrant prediction F1-measure of up to 51.3 %, which is only 7.6 % below the average inter-labeler consistency.
KW - Dynamic Bayesian networks (DBNs)
KW - emotion recognition
KW - intelligent environments
KW - long short-term memory (LSTM)
KW - recurrent neural nets
KW - virtual agents
UR - http://www.scopus.com/inward/record.url?scp=77956721304&partnerID=8YFLogxK
U2 - 10.1109/JSTSP.2010.2057200
DO - 10.1109/JSTSP.2010.2057200
M3 - Article
AN - SCOPUS:77956721304
SN - 1932-4553
VL - 4
SP - 867
EP - 881
JO - IEEE Journal on Selected Topics in Signal Processing
JF - IEEE Journal on Selected Topics in Signal Processing
IS - 5
M1 - 5508344
ER -