TY - JOUR
T1 - Residual Fusion Probabilistic Knowledge Distillation for Speech Enhancement
AU - Cheng, Jiaming
AU - Liang, Ruiyu
AU - Zhou, Lin
AU - Zhao, Li
AU - Huang, Chengwei
AU - Schuller, Bjorn W.
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2024
Y1 - 2024
N2 - In recent years, a great deal of research has focused on in developing neural network (NN)-based speech enhancement (SE) models, which have achieved promising results. However, NN-based models typically require expensive computations to achieve remarkable performance, constraining their deployment in real-world scenarios, especially when hardware resources are limited or when latency requirements are strict. To reduce this computational burden, we propose a unified residual fusion probabilistic knowledge distillation (KD) method for the SE task, in which knowledge is transferred from a deep teacher to a shallower student model. Previous KD approaches commonly focused on narrowing the output distances between teachers and students, but research on the intermediate representation of these models is lacking. In this paper, we first study the cross-layer residual feature fusion strategy, which enables the student model to distill knowledge contained in multiple teacher layers from shallow to deep. Second, a frame weighting probabilistic distillation loss is proposed to assign more emphasis to frames containing essential information and preserve pairwise probabilistic similarities in the representation space. The proposed distillation framework is applied to the dual-path dilated convolutional recurrent network (DPDCRN), which won the championship of the SE track in the L3DAS23 challenge. Extensive experiments are conducted on single-channel and multichannel SE datasets. Objective evaluations show that the proposed KD strategy outperforms other distillation methods and considerably improves the enhancement effect of the low-complexity student model (with only 17% of the teacher's parameters).
AB - In recent years, a great deal of research has focused on in developing neural network (NN)-based speech enhancement (SE) models, which have achieved promising results. However, NN-based models typically require expensive computations to achieve remarkable performance, constraining their deployment in real-world scenarios, especially when hardware resources are limited or when latency requirements are strict. To reduce this computational burden, we propose a unified residual fusion probabilistic knowledge distillation (KD) method for the SE task, in which knowledge is transferred from a deep teacher to a shallower student model. Previous KD approaches commonly focused on narrowing the output distances between teachers and students, but research on the intermediate representation of these models is lacking. In this paper, we first study the cross-layer residual feature fusion strategy, which enables the student model to distill knowledge contained in multiple teacher layers from shallow to deep. Second, a frame weighting probabilistic distillation loss is proposed to assign more emphasis to frames containing essential information and preserve pairwise probabilistic similarities in the representation space. The proposed distillation framework is applied to the dual-path dilated convolutional recurrent network (DPDCRN), which won the championship of the SE track in the L3DAS23 challenge. Extensive experiments are conducted on single-channel and multichannel SE datasets. Objective evaluations show that the proposed KD strategy outperforms other distillation methods and considerably improves the enhancement effect of the low-complexity student model (with only 17% of the teacher's parameters).
KW - Speech enhancement
KW - frame weighting
KW - knowledge distillation (KD)
KW - low-complexity
KW - residual fusion
UR - http://www.scopus.com/inward/record.url?scp=85192207196&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2024.3395978
DO - 10.1109/TASLP.2024.3395978
M3 - Article
AN - SCOPUS:85192207196
SN - 2329-9290
VL - 32
SP - 2680
EP - 2691
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
ER -