TY - JOUR
T1 - Self-attention transfer networks for speech emotion recognition
AU - Zhao, Ziping
AU - Bao, Zhongtian
AU - Zhang, Zixing
AU - Cummins, Nicholas
AU - Sun, Shihuang
AU - Wang, Haishuai
AU - Tao, Jianhua
AU - Schuller, Björn W.
N1 - Publisher Copyright:
© 2019 Beijing Zhongke Journal Publishing Co. Ltd
PY - 2021/2
Y1 - 2021/2
N2 - Background: A crucial element of human–machine interaction, the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models. One vital challenge in speech emotion recognition (SER) is how to learn robust and discriminative representations from speech. Meanwhile, although machine learning methods have been widely applied in SER research, the inadequate amount of available annotated data has become a bottleneck that impedes the extended application of techniques (e.g., deep neural networks). To address this issue, we present a deep learning method that combines knowledge transfer and self-attention for SER tasks. Here, we apply the log-Mel spectrogram with deltas and delta-deltas as input. Moreover, given that emotions are time-dependent, we apply Temporal Convolutional Neural Networks (TCNs) to model the variations in emotions. We further introduce an attention transfer mechanism, which is based on a self-attention algorithm in order to learn long-term dependencies. The Self-Attention Transfer Network (SATN) in our proposed approach, takes advantage of attention autoencoders to learn attention from a source task, and then from speech recognition, followed by transferring this knowledge into SER. Evaluation built on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) demonstrates the effectiveness of the novel model.
AB - Background: A crucial element of human–machine interaction, the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models. One vital challenge in speech emotion recognition (SER) is how to learn robust and discriminative representations from speech. Meanwhile, although machine learning methods have been widely applied in SER research, the inadequate amount of available annotated data has become a bottleneck that impedes the extended application of techniques (e.g., deep neural networks). To address this issue, we present a deep learning method that combines knowledge transfer and self-attention for SER tasks. Here, we apply the log-Mel spectrogram with deltas and delta-deltas as input. Moreover, given that emotions are time-dependent, we apply Temporal Convolutional Neural Networks (TCNs) to model the variations in emotions. We further introduce an attention transfer mechanism, which is based on a self-attention algorithm in order to learn long-term dependencies. The Self-Attention Transfer Network (SATN) in our proposed approach, takes advantage of attention autoencoders to learn attention from a source task, and then from speech recognition, followed by transferring this knowledge into SER. Evaluation built on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) demonstrates the effectiveness of the novel model.
KW - Attention transfer
KW - Self-attention
KW - Speech emotion recognition
KW - Temporal convolutional neural networks (TCNs)
UR - http://www.scopus.com/inward/record.url?scp=85109210009&partnerID=8YFLogxK
U2 - 10.1016/j.vrih.2020.12.002
DO - 10.1016/j.vrih.2020.12.002
M3 - Article
AN - SCOPUS:85109210009
SN - 2096-5796
VL - 3
SP - 43
EP - 54
JO - Virtual Reality and Intelligent Hardware
JF - Virtual Reality and Intelligent Hardware
IS - 1
ER -