TY - JOUR
T1 - Multichannel Speech Enhancement Based on Neural Beamforming and a Context-Focused Post-Filtering Network
AU - Pang, Cong
AU - Fan, Jingjie
AU - Shen, Qifan
AU - Xie, Yue
AU - Huang, Chengwei
AU - Schuller, Bjorn W.
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2024/6/1
Y1 - 2024/6/1
N2 - Both spatial and temporal contextual information are essential for the multichannel speech enhancement (MCSE) task. In this work, we propose a unified MCSE network composed of neural beamforming and a context-focused post-filtering network in order to fully exploit these two types of information. The network is used to estimate the optimum complex ideal ratio masks (cIRMs) which can more effectively utilize the phase information in the frequency domain to reconstruct the speech waveform. To assign adaptive weights to channels, we first adopt a dilated convolution-based network which simulates the beamforming on the original multichannel input spectrum as the front-end of the multichannel acoustic model. Furthermore, we propose a post-filtering network that inputs the suggested U-Net's output to a convolutional long short-term memory (ConvLSTM) layer, as it can properly capture contextual information and spatial correlation information of features. We conduct experiments on the VOICES, CHiME-3, and WMIR data sets, respectively. Experiments show that, in various scenarios, the proposed algorithm shows improvements over the previous state-of-the-art algorithms in terms of PESQ, STOI, and SI-SNR.
AB - Both spatial and temporal contextual information are essential for the multichannel speech enhancement (MCSE) task. In this work, we propose a unified MCSE network composed of neural beamforming and a context-focused post-filtering network in order to fully exploit these two types of information. The network is used to estimate the optimum complex ideal ratio masks (cIRMs) which can more effectively utilize the phase information in the frequency domain to reconstruct the speech waveform. To assign adaptive weights to channels, we first adopt a dilated convolution-based network which simulates the beamforming on the original multichannel input spectrum as the front-end of the multichannel acoustic model. Furthermore, we propose a post-filtering network that inputs the suggested U-Net's output to a convolutional long short-term memory (ConvLSTM) layer, as it can properly capture contextual information and spatial correlation information of features. We conduct experiments on the VOICES, CHiME-3, and WMIR data sets, respectively. Experiments show that, in various scenarios, the proposed algorithm shows improvements over the previous state-of-the-art algorithms in terms of PESQ, STOI, and SI-SNR.
KW - Convolutional long short-term memory (ConvLSTM)
KW - U-Net
KW - dilated convolution
KW - multichannel speech enhancement (MCSE)
KW - neural beamforming
UR - http://www.scopus.com/inward/record.url?scp=85171751516&partnerID=8YFLogxK
U2 - 10.1109/TCDS.2023.3316301
DO - 10.1109/TCDS.2023.3316301
M3 - Article
AN - SCOPUS:85171751516
SN - 2379-8920
VL - 16
SP - 973
EP - 983
JO - IEEE Transactions on Cognitive and Developmental Systems
JF - IEEE Transactions on Cognitive and Developmental Systems
IS - 3
ER -