TY - GEN
T1 - A Global Discriminant Joint Training Framework for Robust Speech Recognition
AU - Li, Lujun
AU - Kurzinger, Ludwig
AU - Watzel, Tobias
AU - Rigoll, Gerhard
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Robustness in adverse acoustic conditions is critical for practical human-machine interaction. A common solution for this problem is adding an independent speech enhancement front-end. Nonetheless, due to being trained separately from the automatic speech recognition (ASR) module, the independent enhancement front-end falls into the sub-optimum easily. Besides, the handcrafted loss function of the enhancement module tends to introduce unseen distortions, which even degrade the ASR performance. To address this concern, a promising idea of the joint training is progressively drawing more interests. Nevertheless, none of the previously proposed joint-training frameworks is built on the increasingly popular self-attention mechanism or generative adversarial architecture. This paper proposes a novel joint-training framework, concatenating a speech enhancement generative adversarial network as the front-end and a self-attention based ASR module as the back-end to be jointly trained as an extensive network, to boost the noise robustness of the end-to-end ASR system. A Sinc convolution layer is usefully merged into the speech enhancement front-end for more representative features extraction. Moreover, a discriminant component plays the role of the local guide of the enhancement module and the global guide in the joint training simultaneously, which guides the enhancement front-end to output more desirable features for the subsequent ASR module and thereby offsets the limitation of the separate training and handcrafted loss functions.Systematic experiments reveal that the proposed framework significantly overtakes other competitive solutions, especially in challenging environments.
AB - Robustness in adverse acoustic conditions is critical for practical human-machine interaction. A common solution for this problem is adding an independent speech enhancement front-end. Nonetheless, due to being trained separately from the automatic speech recognition (ASR) module, the independent enhancement front-end falls into the sub-optimum easily. Besides, the handcrafted loss function of the enhancement module tends to introduce unseen distortions, which even degrade the ASR performance. To address this concern, a promising idea of the joint training is progressively drawing more interests. Nevertheless, none of the previously proposed joint-training frameworks is built on the increasingly popular self-attention mechanism or generative adversarial architecture. This paper proposes a novel joint-training framework, concatenating a speech enhancement generative adversarial network as the front-end and a self-attention based ASR module as the back-end to be jointly trained as an extensive network, to boost the noise robustness of the end-to-end ASR system. A Sinc convolution layer is usefully merged into the speech enhancement front-end for more representative features extraction. Moreover, a discriminant component plays the role of the local guide of the enhancement module and the global guide in the joint training simultaneously, which guides the enhancement front-end to output more desirable features for the subsequent ASR module and thereby offsets the limitation of the separate training and handcrafted loss functions.Systematic experiments reveal that the proposed framework significantly overtakes other competitive solutions, especially in challenging environments.
KW - Sinc convolution
KW - generative adversarial networks
KW - joint training framework
KW - robust speech recognition
KW - self-attention mechanism
KW - speech enhancement
UR - http://www.scopus.com/inward/record.url?scp=85123920301&partnerID=8YFLogxK
U2 - 10.1109/ICTAI52525.2021.00088
DO - 10.1109/ICTAI52525.2021.00088
M3 - Conference contribution
AN - SCOPUS:85123920301
T3 - Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI
SP - 544
EP - 551
BT - Proceedings - 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence, ICTAI 2021
PB - IEEE Computer Society
T2 - 33rd IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2021
Y2 - 1 November 2021 through 3 November 2021
ER -