TY - GEN
T1 - ATTHEAR
T2 - 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024
AU - Akman, Alican
AU - Schuller, Björn W.
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - The increasing success of transformer models in various fields, such as computer vision and audio processing, has led to a growing need for improved explainability to understand their complex decision-making processes better. Most existing techniques for explaining transformer models concentrate primarily on delivering visual and textual explanations, commonly used in visual media. However, audio explanations are crucial due to their intuitiveness on audio-based tasks and distinguishing expressiveness over other modalities. This work proposes a novel method to interpret audio-processing transformer models. Our method combines the available attention mechanism inside these models with non-negative matrix factorisation (NMF) to compute relevancy for audio inputs. While NMF decomposes audio into spectral patterns, attention weights are utilised to calculate time activation for these spectral patterns. The method then generates listenable audio explanations for the model's final decision using the most relevant audio portions. Our model effectively generates explanations by benchmarking against standard datasets, including keyword spotting and environmental sound classification.
AB - The increasing success of transformer models in various fields, such as computer vision and audio processing, has led to a growing need for improved explainability to understand their complex decision-making processes better. Most existing techniques for explaining transformer models concentrate primarily on delivering visual and textual explanations, commonly used in visual media. However, audio explanations are crucial due to their intuitiveness on audio-based tasks and distinguishing expressiveness over other modalities. This work proposes a novel method to interpret audio-processing transformer models. Our method combines the available attention mechanism inside these models with non-negative matrix factorisation (NMF) to compute relevancy for audio inputs. While NMF decomposes audio into spectral patterns, attention weights are utilised to calculate time activation for these spectral patterns. The method then generates listenable audio explanations for the model's final decision using the most relevant audio portions. Our model effectively generates explanations by benchmarking against standard datasets, including keyword spotting and environmental sound classification.
KW - Audio Explainability
KW - Audio Transformers
KW - Computer Audition
KW - Explainable Artificial Intelligence
UR - http://www.scopus.com/inward/record.url?scp=85195416450&partnerID=8YFLogxK
U2 - 10.1109/ICASSP48485.2024.10447390
DO - 10.1109/ICASSP48485.2024.10447390
M3 - Conference contribution
AN - SCOPUS:85195416450
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 7015
EP - 7019
BT - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 14 April 2024 through 19 April 2024
ER -