TY - GEN
T1 - Stargan for emotional speech conversion
T2 - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020
AU - Rizos, Georgios
AU - Baird, Alice
AU - Elliott, Max
AU - Schuller, Björn
N1 - Publisher Copyright:
© 2020 IEEE
PY - 2020/5
Y1 - 2020/5
N2 - In this paper, we propose an adversarial network implementation for speech emotion conversion as a data augmentation method, validated by a multi-class speech affect recognition task. In our setting, we do not assume the availability of parallel data, and we additionally make it a priority to exploit as much as possible the available training data by adopting a cycle-consistent, class-conditional generative adversarial network with an auxiliary domain classifier. Our generated samples are valuable for data augmentation, achieving a corresponding 2 % and 6 % absolute increase in Micro- and Macro-F1 compared to the baseline in a 3-class classification paradigm using a deep, end-to-end network. We finally perform a human perception evaluation of the samples, through which we conclude that our samples are indicative of their target emotion, albeit showing a tendency for confusion in cases where the emotional attribute of valence and arousal are inconsistent.
AB - In this paper, we propose an adversarial network implementation for speech emotion conversion as a data augmentation method, validated by a multi-class speech affect recognition task. In our setting, we do not assume the availability of parallel data, and we additionally make it a priority to exploit as much as possible the available training data by adopting a cycle-consistent, class-conditional generative adversarial network with an auxiliary domain classifier. Our generated samples are valuable for data augmentation, achieving a corresponding 2 % and 6 % absolute increase in Micro- and Macro-F1 compared to the baseline in a 3-class classification paradigm using a deep, end-to-end network. We finally perform a human perception evaluation of the samples, through which we conclude that our samples are indicative of their target emotion, albeit showing a tendency for confusion in cases where the emotional attribute of valence and arousal are inconsistent.
KW - Adversarial networks
KW - Data augmentation
KW - Emotional speech synthesis
KW - End-to-end affective computing
UR - http://www.scopus.com/inward/record.url?scp=85091137207&partnerID=8YFLogxK
U2 - 10.1109/ICASSP40776.2020.9054579
DO - 10.1109/ICASSP40776.2020.9054579
M3 - Conference contribution
AN - SCOPUS:85091137207
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 3502
EP - 3506
BT - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 4 May 2020 through 8 May 2020
ER -