CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition

Ludwig Kürzinger, Dominik Winkelbauer, Lujun Li, Tobias Watzel, Gerhard Rigoll

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

70 Scopus citations

Abstract

Recent end-to-end Automatic Speech Recognition (ASR) systems demonstrated the ability to outperform conventional hybrid DNN/HMM ASR. Aside from architectural improvements in those systems, those models grew in terms of depth, parameters and model capacity. However, these models also require more training data to achieve comparable performance. In this work, we combine freely available corpora for German speech recognition, including yet unlabeled speech data, to a big dataset of over 1700 h of speech data. For data preparation, we propose a two-stage approach that uses an ASR model pre-trained with Connectionist Temporal Classification (CTC) to boot-strap more training data from unsegmented or unlabeled training data. Utterances are then extracted from label probabilities obtained from the network trained with CTC to determine segment alignments. With this training data, we trained a hybrid CTC/attention Transformer model that achieves 12.8% WER on the Tuda-DE test set, surpassing the previous baseline of 14.4% of conventional hybrid DNN/HMM ASR.

Original languageEnglish
Title of host publicationSpeech and Computer - 22nd International Conference, SPECOM 2020, Proceedings
EditorsAlexey Karpov, Rodmonga Potapova
PublisherSpringer Science and Business Media Deutschland GmbH
Pages267-278
Number of pages12
ISBN (Print)9783030602758
DOIs
StatePublished - 2020
Event22nd International Conference on Speech and Computer, SPECOM 2020 - St. Petersburg, Russian Federation
Duration: 7 Oct 20209 Oct 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12335 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference22nd International Conference on Speech and Computer, SPECOM 2020
Country/TerritoryRussian Federation
CitySt. Petersburg
Period7/10/209/10/20

Keywords

  • CTC-segmentation
  • End-to-end automatic speech recognition
  • German speech dataset
  • Hybrid CTC/Attention

Fingerprint

Dive into the research topics of 'CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition'. Together they form a unique fingerprint.

Cite this