TY - JOUR
T1 - Improving Audio Explanations using Audio Language Models
AU - Akman, Alican
AU - Sun, Qiyang
AU - Schuller, Björn W.
N1 - Publisher Copyright:
© 1994-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Foundation models are widely utilised for their strong representational capabilities, driven by training on extensive datasets with self-supervised learning. The increasing complexity of these models highlights the importance of interpretability to enhance transparency and improve human understanding of their decision-making processes. Most existing interpretability methods explain model behaviour by attributing importance to individual data elements across different layers, based on their influence on the final prediction. These approaches often emphasise only the most relevant features, overlooking the broader representational space, removing less important features. In this study, we propose a novel framework for explanation generation that serves as an alternative to feature removal, offering a more comprehensive understanding of model behaviour. Our framework leverages the generative abilities of audio language models to replace removed features with contextually appropriate alternatives, providing a more complete view of the model's decision-making process. Through extensive evaluations on standard benchmarks, including keyword spotting and speech emotion recognition, our approach demonstrates its effectiveness in generating high-quality audio explanations.
AB - Foundation models are widely utilised for their strong representational capabilities, driven by training on extensive datasets with self-supervised learning. The increasing complexity of these models highlights the importance of interpretability to enhance transparency and improve human understanding of their decision-making processes. Most existing interpretability methods explain model behaviour by attributing importance to individual data elements across different layers, based on their influence on the final prediction. These approaches often emphasise only the most relevant features, overlooking the broader representational space, removing less important features. In this study, we propose a novel framework for explanation generation that serves as an alternative to feature removal, offering a more comprehensive understanding of model behaviour. Our framework leverages the generative abilities of audio language models to replace removed features with contextually appropriate alternatives, providing a more complete view of the model's decision-making process. Through extensive evaluations on standard benchmarks, including keyword spotting and speech emotion recognition, our approach demonstrates its effectiveness in generating high-quality audio explanations.
KW - audio explainability
KW - audio transformers
KW - computer audition
KW - explainable artificial intelligence
UR - http://www.scopus.com/inward/record.url?scp=85216636660&partnerID=8YFLogxK
U2 - 10.1109/LSP.2025.3532218
DO - 10.1109/LSP.2025.3532218
M3 - Article
AN - SCOPUS:85216636660
SN - 1070-9908
JO - IEEE Signal Processing Letters
JF - IEEE Signal Processing Letters
ER -