TY - JOUR
T1 - Domain Invariant Feature Learning for Speaker-Independent Speech Emotion Recognition
AU - Lu, Cheng
AU - Zong, Yuan
AU - Zheng, Wenming
AU - Li, Yang
AU - Tang, Chuangao
AU - Schuller, Bjorn W.
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - In this paper, we propose a novel domain invariant feature learning (DIFL) method to deal with speaker-independent speech emotion recognition (SER). The basic idea of DIFL is to learn the speaker-invariant emotion feature by eliminating domain shifts between the training and testing data caused by different speakers from the perspective of multi-source unsupervised domain adaptation (UDA). Specifically, we embed a hierarchical alignment layer with the strong-weak distribution alignment strategy into the feature extraction block to firstly reduce the discrepancy in feature distributions of speech samples across different speakers as much as possible. Furthermore, multiple discriminators in the discriminator block are utilized to confuse the speaker information of emotion features both inside the training data and between the training and testing data. Through them, a multi-domain invariant representation of emotional speech can be gradually and adaptively achieved by updating network parameters. We conduct extensive experiments on three public datasets, i. e., Emo-DB, eNTERFACE, and CASIA, to evaluate the SER performance of the proposed method, respectively. The experimental results show that the proposed method is superior to the state-of-the-art methods.
AB - In this paper, we propose a novel domain invariant feature learning (DIFL) method to deal with speaker-independent speech emotion recognition (SER). The basic idea of DIFL is to learn the speaker-invariant emotion feature by eliminating domain shifts between the training and testing data caused by different speakers from the perspective of multi-source unsupervised domain adaptation (UDA). Specifically, we embed a hierarchical alignment layer with the strong-weak distribution alignment strategy into the feature extraction block to firstly reduce the discrepancy in feature distributions of speech samples across different speakers as much as possible. Furthermore, multiple discriminators in the discriminator block are utilized to confuse the speaker information of emotion features both inside the training data and between the training and testing data. Through them, a multi-domain invariant representation of emotional speech can be gradually and adaptively achieved by updating network parameters. We conduct extensive experiments on three public datasets, i. e., Emo-DB, eNTERFACE, and CASIA, to evaluate the SER performance of the proposed method, respectively. The experimental results show that the proposed method is superior to the state-of-the-art methods.
KW - Speech emotion recognition
KW - adversarial learning
KW - multi-source domain adaptation
KW - speaker independent
KW - unsupervised domain adaptation
UR - http://www.scopus.com/inward/record.url?scp=85135357207&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2022.3178232
DO - 10.1109/TASLP.2022.3178232
M3 - Article
AN - SCOPUS:85135357207
SN - 2329-9290
VL - 30
SP - 2217
EP - 2230
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
ER -