TY - GEN
T1 - Large-Scale Nonverbal Vocalization Detection Using Transformers
AU - Tzirakis, Panagiotis
AU - Baird, Alice
AU - Brooks, Jeffrey
AU - Gagne, Christopher
AU - Kim, Lauren
AU - Opara, Michael
AU - Gregory, Christopher
AU - Metrick, Jacob
AU - Boseck, Garrett
AU - Tiruvadi, Vineet
AU - Schuller, Bjorn
AU - Keltner, Dacher
AU - Cowen, Alan
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Detecting emotionally expressive nonverbal vocalizations is essential to developing technologies that can converse fluently with humans. The affective computing community has largely focused on understanding the intonation of emotional speech and language. However, advances in the study of vocal emotional behavior suggest that emotions may be more readily conveyed not by speech but by nonverbal vocalizations such as laughs, sighs, shrieks, and grunts - vocalizations that often occur in lieu of speech. The task of detecting such emotional vocalizations has been largely overlooked by researchers, likely due to the limited availability of data capturing a sufficiently wide variety of vocalizations. Most studies in the literature focus on detecting laughter or cries. In this paper, we present the first, to the best of our knowledge, nonverbal vocalization detection model trained to detect as many as 67 types of emotional vocalizations. For our purposes, we use the large-scale and in-the-wild HUME-VB dataset that provides more than 156 h of data. We thoroughly investigate the use of pre-trained audio transformer models, such as Wav2Vec2 and Whisper, and provide useful insights for the task at hand using different types of noise signals.
AB - Detecting emotionally expressive nonverbal vocalizations is essential to developing technologies that can converse fluently with humans. The affective computing community has largely focused on understanding the intonation of emotional speech and language. However, advances in the study of vocal emotional behavior suggest that emotions may be more readily conveyed not by speech but by nonverbal vocalizations such as laughs, sighs, shrieks, and grunts - vocalizations that often occur in lieu of speech. The task of detecting such emotional vocalizations has been largely overlooked by researchers, likely due to the limited availability of data capturing a sufficiently wide variety of vocalizations. Most studies in the literature focus on detecting laughter or cries. In this paper, we present the first, to the best of our knowledge, nonverbal vocalization detection model trained to detect as many as 67 types of emotional vocalizations. For our purposes, we use the large-scale and in-the-wild HUME-VB dataset that provides more than 156 h of data. We thoroughly investigate the use of pre-trained audio transformer models, such as Wav2Vec2 and Whisper, and provide useful insights for the task at hand using different types of noise signals.
KW - Nonverbal vocalization
KW - transformers
KW - vo-cal burst detection
UR - http://www.scopus.com/inward/record.url?scp=85163984944&partnerID=8YFLogxK
U2 - 10.1109/ICASSP49357.2023.10095294
DO - 10.1109/ICASSP49357.2023.10095294
M3 - Conference contribution
AN - SCOPUS:85163984944
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
BT - ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023
Y2 - 4 June 2023 through 10 June 2023
ER -