Exploring hybrid ctc/attention end-to-end speech recognition with gaussian processes

Ludwig Kürzinger, Tobias Watzel, Lujun Li, Robert Baumgartner, Gerhard Rigoll

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Hybrid CTC/attention end-to-end speech recognition combines two powerful concepts. Given a speech feature sequence, the attention mechanism directly outputs a sequence of letters. Connectionist Temporal Classification (CTC) helps to bind the attention mechanism to sequential alignments. This hybrid architecture also gives more degrees of freedom in choosing parameter configurations. We applied Gaussian process optimization to estimate the impact of network parameters and language model weight in decoding towards Character Error Rate (CER), as well as attention accuracy. In total, we trained 70 hybrid CTC/attention networks and performed 590 beam search runs with an RNNLM as language model on the TEDlium v2 test set. To our surprise, the results challenge the assumption that CTC primarily regularizes the attention mechanism. We argue in an evidence-based manner that CTC instead regularizes the impact of language model feedback in a one-pass beam search, as letter hypotheses are fed back into the attention mechanism. Attention-only models without RNNLM already achieved 10.9 % CER, or 22.4 % Word Error Rate (WER), on the TEDlium v2 test set. Combined decoding of same attention-only networks with RNNLM strongly underperformed, with at best 40.2 % CER, or, 49.3 % WER. A combined hybrid CTC/attention model with RNNLM performed best, with 8.9 % CER, or 17.6 % WER.

Original languageEnglish
Title of host publicationSpeech and Computer - 21st International Conference, SPECOM 2019, Proceedings
EditorsAlbert Ali Salah, Albert Ali Salah, Alexey Karpov, Rodmonga Potapova
PublisherSpringer Verlag
Pages258-269
Number of pages12
ISBN (Print)9783030260606
DOIs
StatePublished - 2019
Event21st International Conference on Speech and Computer, SPECOM 2019 - Istanbul, Turkey
Duration: 20 Aug 201925 Aug 2019

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11658 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference21st International Conference on Speech and Computer, SPECOM 2019
Country/TerritoryTurkey
CityIstanbul
Period20/08/1925/08/19

Keywords

  • Attention-based neural networks
  • Connectionist Temporal Classification
  • End-to-end speech recognition
  • Gaussian process optimization
  • Hybrid CTC/attention
  • Multi-objective training

Fingerprint

Dive into the research topics of 'Exploring hybrid ctc/attention end-to-end speech recognition with gaussian processes'. Together they form a unique fingerprint.

Cite this