Abstract
Continuous dimensional emotion recognition from audio is a sequential regression problem, where the goal is to maximize correlation between sequences of regression outputs and continuous-valued emotion contours, while minimizing the average deviation. As in other domains, deep neural networks trained on simple acoustic features achieve good performance on this task. Yet, the usual squared error objective functions for neural network training do not fully take into account the above-named goal. Hence, in this paper we introduce a technique for the discriminative training of deep neural networks using the concordance correlation coefficient as cost function, which unites both correlation and mean squared error in a single differentiable function. Results on the MediaEval 2013 and AV+EC 2015 Challenge data sets show that the proposed method can significantly improve the evaluation criteria compared to standard mean squared error training, both in the music and speech domains.
Original language | English |
---|---|
Pages (from-to) | 2196-2202 |
Number of pages | 7 |
Journal | IJCAI International Joint Conference on Artificial Intelligence |
Volume | 2016-January |
State | Published - 2016 |
Externally published | Yes |
Event | 25th International Joint Conference on Artificial Intelligence, IJCAI 2016 - New York, United States Duration: 9 Jul 2016 → 15 Jul 2016 |