Light-weight self-attention augmented generative adversarial networks for speech enhancement

Lujun Li, Zhenxing Lu, Tobias Watzel, Ludwig Kürzinger, Gerhard Rigoll

Research output: Contribution to journalArticlepeer-review

9 Scopus citations

Abstract

Generative adversarial networks (GANs) have shown their superiority for speech enhancement. Nevertheless, most previous attempts had convolutional layers as the backbone, which may obscure long-range dependencies across an input sequence due to the convolution operator’s local receptive field. One popular solution is substituting recurrent neural networks (RNNs) for convolutional neural networks, but RNNs are computationally inefficient, caused by the unparallelization of their temporal iterations. To circumvent this limitation, we propose an end-to-end system for speech enhancement by applying the self-attention mechanism to GANs. We aim to achieve a system that is flexible in modeling both long-range and local interactions and can be computationally efficient at the same time. Our work is implemented in three phases: firstly, we apply the stand-alone self-attention layer in speech enhancement GANs. Secondly, we employ locality modeling on the stand-alone self-attention layer. Lastly, we investigate the functionality of the self-attention augmented convolutional speech enhancement GANs. Systematic experiment results indicate that equipped with the stand-alone self-attention layer, the system outperforms baseline systems across classic evaluation criteria with up to 95 % fewer parameters. Moreover, locality modeling can be a parameter-free approach for further performance improvement, and self-attention augmentation also overtakes all baseline systems with acceptably increased parameters.

Original languageEnglish
Article number1586
JournalElectronics (Switzerland)
Volume10
Issue number13
DOIs
StatePublished - 1 Jul 2021

Keywords

  • Generative adversarial networks
  • Self-attention mechanism
  • Speech enhancement

Fingerprint

Dive into the research topics of 'Light-weight self-attention augmented generative adversarial networks for speech enhancement'. Together they form a unique fingerprint.

Cite this