TY - GEN
T1 - Exploring hybrid ctc/attention end-to-end speech recognition with gaussian processes
AU - Kürzinger, Ludwig
AU - Watzel, Tobias
AU - Li, Lujun
AU - Baumgartner, Robert
AU - Rigoll, Gerhard
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2019.
PY - 2019
Y1 - 2019
N2 - Hybrid CTC/attention end-to-end speech recognition combines two powerful concepts. Given a speech feature sequence, the attention mechanism directly outputs a sequence of letters. Connectionist Temporal Classification (CTC) helps to bind the attention mechanism to sequential alignments. This hybrid architecture also gives more degrees of freedom in choosing parameter configurations. We applied Gaussian process optimization to estimate the impact of network parameters and language model weight in decoding towards Character Error Rate (CER), as well as attention accuracy. In total, we trained 70 hybrid CTC/attention networks and performed 590 beam search runs with an RNNLM as language model on the TEDlium v2 test set. To our surprise, the results challenge the assumption that CTC primarily regularizes the attention mechanism. We argue in an evidence-based manner that CTC instead regularizes the impact of language model feedback in a one-pass beam search, as letter hypotheses are fed back into the attention mechanism. Attention-only models without RNNLM already achieved 10.9 % CER, or 22.4 % Word Error Rate (WER), on the TEDlium v2 test set. Combined decoding of same attention-only networks with RNNLM strongly underperformed, with at best 40.2 % CER, or, 49.3 % WER. A combined hybrid CTC/attention model with RNNLM performed best, with 8.9 % CER, or 17.6 % WER.
AB - Hybrid CTC/attention end-to-end speech recognition combines two powerful concepts. Given a speech feature sequence, the attention mechanism directly outputs a sequence of letters. Connectionist Temporal Classification (CTC) helps to bind the attention mechanism to sequential alignments. This hybrid architecture also gives more degrees of freedom in choosing parameter configurations. We applied Gaussian process optimization to estimate the impact of network parameters and language model weight in decoding towards Character Error Rate (CER), as well as attention accuracy. In total, we trained 70 hybrid CTC/attention networks and performed 590 beam search runs with an RNNLM as language model on the TEDlium v2 test set. To our surprise, the results challenge the assumption that CTC primarily regularizes the attention mechanism. We argue in an evidence-based manner that CTC instead regularizes the impact of language model feedback in a one-pass beam search, as letter hypotheses are fed back into the attention mechanism. Attention-only models without RNNLM already achieved 10.9 % CER, or 22.4 % Word Error Rate (WER), on the TEDlium v2 test set. Combined decoding of same attention-only networks with RNNLM strongly underperformed, with at best 40.2 % CER, or, 49.3 % WER. A combined hybrid CTC/attention model with RNNLM performed best, with 8.9 % CER, or 17.6 % WER.
KW - Attention-based neural networks
KW - Connectionist Temporal Classification
KW - End-to-end speech recognition
KW - Gaussian process optimization
KW - Hybrid CTC/attention
KW - Multi-objective training
UR - http://www.scopus.com/inward/record.url?scp=85071424036&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-26061-3_27
DO - 10.1007/978-3-030-26061-3_27
M3 - Conference contribution
AN - SCOPUS:85071424036
SN - 9783030260606
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 258
EP - 269
BT - Speech and Computer - 21st International Conference, SPECOM 2019, Proceedings
A2 - Salah, Albert Ali
A2 - Salah, Albert Ali
A2 - Karpov, Alexey
A2 - Potapova, Rodmonga
PB - Springer Verlag
T2 - 21st International Conference on Speech and Computer, SPECOM 2019
Y2 - 20 August 2019 through 25 August 2019
ER -