TY - JOUR
T1 - Channel mapping using bidirectional long short-term memory for dereverberation in hands-free voice controlled devices
AU - Zhang, Zixing
AU - Pinto, Joel
AU - Plahl, Christian
AU - Schuller, Björn
AU - Willett, Daniel
N1 - Publisher Copyright:
© 1975-2011 IEEE.
PY - 2014/8/1
Y1 - 2014/8/1
N2 - In this article, the reverberation problem for hands-free voice controlled devices is addressed by employing Bidirectional Long Short-Term Memory (BLSTM) recurrent neural networks. Such networks use memory blocks in the hidden units, enabling them to exploit a self-learnt amount of temporal context. The main objective of this technique is to minimize the mismatch between the distant talk (reverberant/distorted) speech and the close talk (clean) speech. To achieve this, the network is trained by mapping the cepstral feature space from the distant talk channel to its counterpart from the close talk channel frame-wisely in terms of regression. The method has been successfully evaluated on a realistically recorded reverberant French corpus by a large scale of experiments of comparing a variety of network architectures, investigating different network training targets (differential or absolute), and combining with common adaptation techniques. In addition, the robustness of this technique is also accessed by cross-room evaluation on both, a simulated French corpus and a realistic English corpus. Experimental results show that the proposed novel BLSTM dereverberation models trained by the differential targets reduce the word error rate (WER) by 16% relatively on the French corpus (intra room scenario) as well as 8% relatively on the English corpus (inter room scenario).
AB - In this article, the reverberation problem for hands-free voice controlled devices is addressed by employing Bidirectional Long Short-Term Memory (BLSTM) recurrent neural networks. Such networks use memory blocks in the hidden units, enabling them to exploit a self-learnt amount of temporal context. The main objective of this technique is to minimize the mismatch between the distant talk (reverberant/distorted) speech and the close talk (clean) speech. To achieve this, the network is trained by mapping the cepstral feature space from the distant talk channel to its counterpart from the close talk channel frame-wisely in terms of regression. The method has been successfully evaluated on a realistically recorded reverberant French corpus by a large scale of experiments of comparing a variety of network architectures, investigating different network training targets (differential or absolute), and combining with common adaptation techniques. In addition, the robustness of this technique is also accessed by cross-room evaluation on both, a simulated French corpus and a realistic English corpus. Experimental results show that the proposed novel BLSTM dereverberation models trained by the differential targets reduce the word error rate (WER) by 16% relatively on the French corpus (intra room scenario) as well as 8% relatively on the English corpus (inter room scenario).
KW - Bidirectional Long Short-Term Memory
KW - Dereverberation.
KW - Hand-Free Voiced Controlled Devices
KW - Indirect Feature Enhancement
UR - http://www.scopus.com/inward/record.url?scp=84908661098&partnerID=8YFLogxK
U2 - 10.1109/TCE.2014.6937339
DO - 10.1109/TCE.2014.6937339
M3 - Article
AN - SCOPUS:84908661098
SN - 0098-3063
VL - 60
SP - 525
EP - 533
JO - IEEE Transactions on Consumer Electronics
JF - IEEE Transactions on Consumer Electronics
IS - 3
M1 - 6937339
ER -