Improving Audio Explanations using Audio Language Models

Alican Akman, Qiyang Sun, Björn W. Schuller

Research output: Contribution to journalArticlepeer-review

Abstract

Foundation models are widely utilised for their strong representational capabilities, driven by training on extensive datasets with self-supervised learning. The increasing complexity of these models highlights the importance of interpretability to enhance transparency and improve human understanding of their decision-making processes. Most existing interpretability methods explain model behaviour by attributing importance to individual data elements across different layers, based on their influence on the final prediction. These approaches often emphasise only the most relevant features, overlooking the broader representational space, removing less important features. In this study, we propose a novel framework for explanation generation that serves as an alternative to feature removal, offering a more comprehensive understanding of model behaviour. Our framework leverages the generative abilities of audio language models to replace removed features with contextually appropriate alternatives, providing a more complete view of the model's decision-making process. Through extensive evaluations on standard benchmarks, including keyword spotting and speech emotion recognition, our approach demonstrates its effectiveness in generating high-quality audio explanations.

Original languageEnglish
JournalIEEE Signal Processing Letters
DOIs
StateAccepted/In press - 2025

Keywords

  • audio explainability
  • audio transformers
  • computer audition
  • explainable artificial intelligence

Fingerprint

Dive into the research topics of 'Improving Audio Explanations using Audio Language Models'. Together they form a unique fingerprint.

Cite this