TY - GEN
T1 - A paralinguistic approach to speaker diarisation using age, gender, voice likability and personality traits
AU - Zhang, Yue
AU - Weninger, Felix
AU - Liu, Boqing
AU - Schmitt, Maximilian
AU - Eyben, Florian
AU - Schuller, Björn
N1 - Publisher Copyright:
© 2017 Copyright held by the owner/author(s).
PY - 2017/10/23
Y1 - 2017/10/23
N2 - In this work, we present a new view on automatic speaker diarisation, i. e., assessing "who speaks when", based on the recognition of speaker traits such as age, gender, voice likability, and personality. Traditionally, speaker diarisation is accomplished using low-level audio descriptors (e. g., cepstral or spectral features), neglecting the fact that speakers can be well discriminated by humans according to various perceived characteristics. Thus, we advocate a novel paralinguistic approach that combines speaker diarisation with speaker characterisation by automatically identifying the speakers according to their individual traits. In a three-tier processing flow, speaker segmentation by voice activity detection (VAD) is initially performed to detect speaker turns. Next, speaker attributes are predicted using pre-trained paralinguistic models. To tag the speakers, clustering algorithms are applied to the predicted traits. We evaluate our methods against state-of-the-art open source and commercial systems on a corpus of realistic, spontaneous dyadic conversations recorded in the wild from three different cultures (Chinese, English, German). Our results provide clear evidence that using paralinguistic features for speaker diarisation is a promising avenue of research.
AB - In this work, we present a new view on automatic speaker diarisation, i. e., assessing "who speaks when", based on the recognition of speaker traits such as age, gender, voice likability, and personality. Traditionally, speaker diarisation is accomplished using low-level audio descriptors (e. g., cepstral or spectral features), neglecting the fact that speakers can be well discriminated by humans according to various perceived characteristics. Thus, we advocate a novel paralinguistic approach that combines speaker diarisation with speaker characterisation by automatically identifying the speakers according to their individual traits. In a three-tier processing flow, speaker segmentation by voice activity detection (VAD) is initially performed to detect speaker turns. Next, speaker attributes are predicted using pre-trained paralinguistic models. To tag the speakers, clustering algorithms are applied to the predicted traits. We evaluate our methods against state-of-the-art open source and commercial systems on a corpus of realistic, spontaneous dyadic conversations recorded in the wild from three different cultures (Chinese, English, German). Our results provide clear evidence that using paralinguistic features for speaker diarisation is a promising avenue of research.
KW - Computational paralinguistics
KW - Speaker characteristics
KW - Speaker diarisation
KW - Speaker identification
UR - http://www.scopus.com/inward/record.url?scp=85035195602&partnerID=8YFLogxK
U2 - 10.1145/3123266.3123338
DO - 10.1145/3123266.3123338
M3 - Conference contribution
AN - SCOPUS:85035195602
T3 - MM 2017 - Proceedings of the 2017 ACM Multimedia Conference
SP - 387
EP - 392
BT - MM 2017 - Proceedings of the 2017 ACM Multimedia Conference
PB - Association for Computing Machinery, Inc
T2 - 25th ACM International Conference on Multimedia, MM 2017
Y2 - 23 October 2017 through 27 October 2017
ER -