TY - JOUR
T1 - A deep adaptation network for speech enhancement
T2 - Combining a relativistic discriminator with multi-kernel maximum mean discrepancy
AU - Cheng, Jiaming
AU - Liang, Ruiyu
AU - Liang, Zhenlin
AU - Zhao, Li
AU - Huang, Chengwei
AU - Schuller, Bjorn
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2021
Y1 - 2021
N2 - In deep-learning-based speech enhancement (SE) systems, trained models are often used to handle unseen noise types and language environments in real-life scenarios. However, since production environments differ from training conditions, mismatch problems arise that may cause a serious decrease in the performance of an SE system. In this study, a domain adaptive method combining two adaptation strategies is proposed to improve the generalization of unlabeled noisy speech. In the proposed encoder-decoder-based SE framework, a domain discriminator and a domain confusion adaptation layer are introduced to conduct adversarial training. The model has two main innovations. First, the algorithm optimizes adversarial training by introducing a relativistic discriminator that relies on relative values by applying the difference, thus avoiding possible bias and better reflecting domain differences. Second, the multi-kernel maximum mean discrepancy (MK-MMD) between domains is taken as the regularization term of the domain adversarial loss, thereby further decreasing the edge distribution distance between domains. The proposed model improves the adaptability to unseen noises by encouraging the feature encoder to generate domain-invariant features. The model was evaluated using cross-noise and cross-language-and-noise experiments, and the results show that the proposed method provides considerable improvements over the baseline without an adaptation in the perceptual evaluation of speech quality (PESQ), the short time objective intelligibility (STOI) and the frequency-weighted signal-to-noise ratio (FWSNR).
AB - In deep-learning-based speech enhancement (SE) systems, trained models are often used to handle unseen noise types and language environments in real-life scenarios. However, since production environments differ from training conditions, mismatch problems arise that may cause a serious decrease in the performance of an SE system. In this study, a domain adaptive method combining two adaptation strategies is proposed to improve the generalization of unlabeled noisy speech. In the proposed encoder-decoder-based SE framework, a domain discriminator and a domain confusion adaptation layer are introduced to conduct adversarial training. The model has two main innovations. First, the algorithm optimizes adversarial training by introducing a relativistic discriminator that relies on relative values by applying the difference, thus avoiding possible bias and better reflecting domain differences. Second, the multi-kernel maximum mean discrepancy (MK-MMD) between domains is taken as the regularization term of the domain adversarial loss, thereby further decreasing the edge distribution distance between domains. The proposed model improves the adaptability to unseen noises by encouraging the feature encoder to generate domain-invariant features. The model was evaluated using cross-noise and cross-language-and-noise experiments, and the results show that the proposed method provides considerable improvements over the baseline without an adaptation in the perceptual evaluation of speech quality (PESQ), the short time objective intelligibility (STOI) and the frequency-weighted signal-to-noise ratio (FWSNR).
KW - Deep neural network
KW - domain adaptation
KW - maximum mean discrepancy
KW - relativistic discriminator
KW - speech enhancement
UR - http://www.scopus.com/inward/record.url?scp=85096835555&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2020.3036611
DO - 10.1109/TASLP.2020.3036611
M3 - Article
AN - SCOPUS:85096835555
SN - 2329-9290
VL - 29
SP - 41
EP - 53
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
M1 - 9252849
ER -