TY - GEN
T1 - MP3 Compression to Diminish Adversarial Noise in End-to-End Speech Recognition
AU - Andronic, Iustina
AU - Kürzinger, Ludwig
AU - Chavez Rosas, Edgar Ricardo
AU - Rigoll, Gerhard
AU - Seeber, Bernhard U.
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - Audio Adversarial Examples (AAE) represent purposefully designed inputs meant to trick Automatic Speech Recognition (ASR) systems into misclassification. The present work proposes MP3 compression as a means to decrease the impact of Adversarial Noise (AN) in audio samples transcribed by ASR systems. To this end, we generated AAEs with a new variant of the Fast Gradient Sign Method for an end-to-end, hybrid CTC-attention ASR system. The MP3’s effectiveness against AN is then validated by two objective indicators: (1) Character Error Rates (CER) that measure the speech decoding performance of four ASR models trained on different audio formats (both uncompressed and MP3-compressed) and (2) Signal-to-Noise Ratio (SNR) estimated for uncompressed and MP3-compressed AAEs that are reconstructed in the time domain by feature inversion. We found that MP3 compression applied to AAEs indeed reduces the CER when compared to uncompressed AAEs. Moreover, feature-inverted (reconstructed) AAEs had significantly higher SNRs after MP3 compression, indicating that AN was reduced. In contrast to AN, MP3 compression applied to utterances augmented with regular noise resulted in more transcription errors, giving further evidence that MP3 encoding is effective in diminishing AN exclusively.
AB - Audio Adversarial Examples (AAE) represent purposefully designed inputs meant to trick Automatic Speech Recognition (ASR) systems into misclassification. The present work proposes MP3 compression as a means to decrease the impact of Adversarial Noise (AN) in audio samples transcribed by ASR systems. To this end, we generated AAEs with a new variant of the Fast Gradient Sign Method for an end-to-end, hybrid CTC-attention ASR system. The MP3’s effectiveness against AN is then validated by two objective indicators: (1) Character Error Rates (CER) that measure the speech decoding performance of four ASR models trained on different audio formats (both uncompressed and MP3-compressed) and (2) Signal-to-Noise Ratio (SNR) estimated for uncompressed and MP3-compressed AAEs that are reconstructed in the time domain by feature inversion. We found that MP3 compression applied to AAEs indeed reduces the CER when compared to uncompressed AAEs. Moreover, feature-inverted (reconstructed) AAEs had significantly higher SNRs after MP3 compression, indicating that AN was reduced. In contrast to AN, MP3 compression applied to utterances augmented with regular noise resulted in more transcription errors, giving further evidence that MP3 encoding is effective in diminishing AN exclusively.
KW - Audio Adversarial Examples
KW - Automatic Speech Recognition (ASR)
KW - MP3 compression
UR - http://www.scopus.com/inward/record.url?scp=85092922685&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-60276-5_3
DO - 10.1007/978-3-030-60276-5_3
M3 - Conference contribution
AN - SCOPUS:85092922685
SN - 9783030602758
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 22
EP - 34
BT - Speech and Computer - 22nd International Conference, SPECOM 2020, Proceedings
A2 - Karpov, Alexey
A2 - Potapova, Rodmonga
PB - Springer Science and Business Media Deutschland GmbH
T2 - 22nd International Conference on Speech and Computer, SPECOM 2020
Y2 - 7 October 2020 through 9 October 2020
ER -