TY - JOUR
T1 - This Paper Had the Smartest Reviewers - Flattery Detection Utilising an Audio-Textual Transformer-Based Approach
AU - Christ, Lukas
AU - Amiriparian, Shahin
AU - Hawighorst, Friederike
AU - Schill, Ann Kathrin
AU - Boutalikakis, Angelo
AU - Graf-Vlachy, Lorenz
AU - König, Andreas
AU - Schuller, Björn W.
N1 - Publisher Copyright:
© 2024 International Speech Communication Association. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Flattery is an important aspect of human communication that facilitates social bonding, shapes perceptions, and influences behaviour through strategic compliments and praise, leveraging the power of speech to build rapport effectively. Its automatic detection can thus enhance the naturalness of human-AI interactions. To meet this need, we present a novel audio textual dataset comprising 20 hours of speech and train machine learning models for automatic flattery detection. In particular, we employ pretrained AST, Wav2Vec2, and Whisper models for the speech modality, and Whisper TTS models combined with a RoBERTa text classifier for the textual modality. Subsequently, we build a multimodal classifier by combining text and audio representations. Evaluation on unseen test data demonstrates promising results, with Unweighted Average Recall scores reaching 82.46% in audio-only experiments, 85.97 % in text-only experiments, and 87.16 % using a multimodal approach.
AB - Flattery is an important aspect of human communication that facilitates social bonding, shapes perceptions, and influences behaviour through strategic compliments and praise, leveraging the power of speech to build rapport effectively. Its automatic detection can thus enhance the naturalness of human-AI interactions. To meet this need, we present a novel audio textual dataset comprising 20 hours of speech and train machine learning models for automatic flattery detection. In particular, we employ pretrained AST, Wav2Vec2, and Whisper models for the speech modality, and Whisper TTS models combined with a RoBERTa text classifier for the textual modality. Subsequently, we build a multimodal classifier by combining text and audio representations. Evaluation on unseen test data demonstrates promising results, with Unweighted Average Recall scores reaching 82.46% in audio-only experiments, 85.97 % in text-only experiments, and 87.16 % using a multimodal approach.
KW - Transformers
KW - computational paralinguistics
KW - flattery
KW - human-AI interaction
KW - speech classification
UR - http://www.scopus.com/inward/record.url?scp=85214824710&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2024-87
DO - 10.21437/Interspeech.2024-87
M3 - Conference article
AN - SCOPUS:85214824710
SN - 2308-457X
SP - 3530
EP - 3534
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 25th Interspeech Conferece 2024
Y2 - 1 September 2024 through 5 September 2024
ER -