TY - JOUR
T1 - ParaCLAP - Towards a general language-audio model for computational paralinguistic tasks
AU - Jing, Xin
AU - Triantafyllopoulos, Andreas
AU - Schuller, Björn
N1 - Publisher Copyright:
© 2024 International Speech Communication Association. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Contrastive language-audio pretraining (CLAP) has recently emerged as a method for making audio analysis more general-isable. Specifically, CLAP-style models are able to 'answer' a diverse set of language queries, extending the capabilities of audio models beyond a closed set of labels. However, CLAP relies on a large set of (audio, query) pairs for pretraining. While such sets are available for general audio tasks, like captioning or sound event detection, there are no datasets with matched audio and text queries for computational paralinguistic (CP) tasks. As a result, the community relies on generic CLAP models trained for general audio with limited success. In the present study, we explore training considerations for ParaCLAP, a CLAP-style model suited to CP, including a novel process for creating audio-language queries. We demonstrate its effectiveness on a set of computational paralinguistic tasks, where it is shown to surpass the performance of open-source state-of-the-art models. Our code and resources are publicly available at: https://github.com/KeiKinn/ParaCLAP.
AB - Contrastive language-audio pretraining (CLAP) has recently emerged as a method for making audio analysis more general-isable. Specifically, CLAP-style models are able to 'answer' a diverse set of language queries, extending the capabilities of audio models beyond a closed set of labels. However, CLAP relies on a large set of (audio, query) pairs for pretraining. While such sets are available for general audio tasks, like captioning or sound event detection, there are no datasets with matched audio and text queries for computational paralinguistic (CP) tasks. As a result, the community relies on generic CLAP models trained for general audio with limited success. In the present study, we explore training considerations for ParaCLAP, a CLAP-style model suited to CP, including a novel process for creating audio-language queries. We demonstrate its effectiveness on a set of computational paralinguistic tasks, where it is shown to surpass the performance of open-source state-of-the-art models. Our code and resources are publicly available at: https://github.com/KeiKinn/ParaCLAP.
KW - computational paralinguistic
KW - contrastive learning
KW - speech emotion recognition
KW - zero-shot learning
UR - http://www.scopus.com/inward/record.url?scp=85206489054&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2024-1315
DO - 10.21437/Interspeech.2024-1315
M3 - Conference article
AN - SCOPUS:85206489054
SN - 2308-457X
SP - 1155
EP - 1159
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 25th Interspeech Conferece 2024
Y2 - 1 September 2024 through 5 September 2024
ER -