Lightweight end-to-end speech recognition from raw audio data using sinc-convolutions

Ludwig Kürzinger, Nicolas Lindae, Palle Klewitz, Gerhard Rigoll

Publikation: Beitrag in Buch/Bericht/KonferenzbandKonferenzbeitragBegutachtung

4 Zitate (Scopus)

Abstract

Many end-to-end Automatic Speech Recognition (ASR) systems still rely on pre-processed frequency-domain features that are handcrafted to emulate the human hearing. Our work is motivated by recent advances in integrated learnable feature extraction. For this, we propose Lightweight Sinc-Convolutions (LSC) that integrate Sinc-convolutions with depthwise convolutions as a low-parameter machine-learnable feature extraction for end-to-end ASR systems. We integrated LSC into the hybrid CTC/attention architecture for evaluation. The resulting end-to-end model shows smooth convergence behaviour that is further improved by applying SpecAugment in the time domain. We also discuss filter-level improvements, such as using log-compression as activation function. Our model achieves a word error rate of 10.7% on the TEDlium v2 test dataset, surpassing the corresponding architecture with log-mel filterbank features by an absolute 1.9%, but only has 21% of its model size.

OriginalspracheEnglisch
TitelInterspeech 2020
Herausgeber (Verlag)International Speech Communication Association
Seiten1659-1663
Seitenumfang5
ISBN (Print)9781713820697
DOIs
PublikationsstatusVeröffentlicht - 2020
Veranstaltung21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, China
Dauer: 25 Okt. 202029 Okt. 2020

Publikationsreihe

NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Band2020-October
ISSN (Print)2308-457X
ISSN (elektronisch)1990-9772

Konferenz

Konferenz21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Land/GebietChina
OrtShanghai
Zeitraum25/10/2029/10/20

Fingerprint

Untersuchen Sie die Forschungsthemen von „Lightweight end-to-end speech recognition from raw audio data using sinc-convolutions“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren