Noise robust ASR in reverberated multisource environments applying convolutive NMF and Long Short-Term Memory

Martin Wöllmer, Felix Weninger, Jürgen Geiger, Björn Schuller, Gerhard Rigoll

Research output: Contribution to journalArticlepeer-review

12 Scopus citations

Abstract

This article proposes and evaluates various methods to integrate the concept of bidirectional Long Short-Term Memory (BLSTM) temporal context modeling into a system for automatic speech recognition (ASR) in noisy and reverberated environments. Building on recent advances in Long Short-Term Memory architectures for ASR, we design a novel front-end for context-sensitive Tandem feature extraction and show how the Connectionist Temporal Classification approach can be used as a BLSTM-based back-end, alternatively to Hidden Markov Models (HMM). We combine context-sensitive BLSTM-based feature generation and speech decoding techniques with source separation by convolutive non-negative matrix factorization. Applying our speaker adapted multi-stream HMM framework that processes MFCC features from NMF-enhanced speech as well as word predictions obtained via BLSTM networks and non-negative sparse classification (NSC), we obtain an average accuracy of 91.86% on the PASCAL CHiME Challenge task at signal-to-noise ratios ranging from -6 to 9 dB. To our knowledge, this is the best result ever reported for the CHiME Challenge task.

Original languageEnglish
Article number532
Pages (from-to)780-797
Number of pages18
JournalComputer Speech and Language
Volume27
Issue number3
DOIs
StatePublished - 1 May 2013

Keywords

  • Automatic speech recognition
  • Long Short-Term Memory
  • Non-negative matrix factorization
  • Tandem feature extraction

Fingerprint

Dive into the research topics of 'Noise robust ASR in reverberated multisource environments applying convolutive NMF and Long Short-Term Memory'. Together they form a unique fingerprint.

Cite this