ParaCLAP - Towards a general language-audio model for computational paralinguistic tasks

Xin Jing, Andreas Triantafyllopoulos, Björn Schuller

Research output: Contribution to journalConference articlepeer-review

Abstract

Contrastive language-audio pretraining (CLAP) has recently emerged as a method for making audio analysis more general-isable. Specifically, CLAP-style models are able to 'answer' a diverse set of language queries, extending the capabilities of audio models beyond a closed set of labels. However, CLAP relies on a large set of (audio, query) pairs for pretraining. While such sets are available for general audio tasks, like captioning or sound event detection, there are no datasets with matched audio and text queries for computational paralinguistic (CP) tasks. As a result, the community relies on generic CLAP models trained for general audio with limited success. In the present study, we explore training considerations for ParaCLAP, a CLAP-style model suited to CP, including a novel process for creating audio-language queries. We demonstrate its effectiveness on a set of computational paralinguistic tasks, where it is shown to surpass the performance of open-source state-of-the-art models. Our code and resources are publicly available at: https://github.com/KeiKinn/ParaCLAP.

Original languageEnglish
Pages (from-to)1155-1159
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 2024
Event25th Interspeech Conferece 2024 - Kos Island, Greece
Duration: 1 Sep 20245 Sep 2024

Keywords

  • computational paralinguistic
  • contrastive learning
  • speech emotion recognition
  • zero-shot learning

Fingerprint

Dive into the research topics of 'ParaCLAP - Towards a general language-audio model for computational paralinguistic tasks'. Together they form a unique fingerprint.

Cite this