TY - JOUR
T1 - Request and complaint recognition in call-center speech using a pointwise-convolution recurrent network
AU - Yin, Zhipeng
AU - Xu, Xinzhou
AU - Schuller, Björn
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
PY - 2025
Y1 - 2025
N2 - The task of request and complaint recognition in call-center speech aims to identify both intention and emotion states for call speakers, through analysing the paralinguistic information conveyed by spoken signals. Nevertheless, existing related works fail to make full use of the fusion for the multi-layer representations derived from foundation models, and further, these works usually include insufficient sequential encoding for temporal information in speech. Specifically for recognising requests and complaints in call-center speech, we propose an approach using a PointWise-Convolution Recurrent network (PWCR) in this paper. Within the proposed approach, we first propose the pointwise-convolution module to perform layer-wise aggregation for the representations, from the multiple Transformer layers contained in a pre-trained foundation model. Then, a recurrent module is employed to capture effective temporal and contextual information, through a recurrent layer with multi-head self-attention. Subsequently, the experimental results on the HealthCall30 Corpus for request and complaint recognition in call-center speech indicate that, the proposed approach can achieve better recognition performance, compared with state-of-the-art approaches, resulting in unweighted average recalls of 78.7% (maximum) / 77.3% (average) and 60.6% (maximum) / 59.8% (average) for the request and complaint tasks, respectively.
AB - The task of request and complaint recognition in call-center speech aims to identify both intention and emotion states for call speakers, through analysing the paralinguistic information conveyed by spoken signals. Nevertheless, existing related works fail to make full use of the fusion for the multi-layer representations derived from foundation models, and further, these works usually include insufficient sequential encoding for temporal information in speech. Specifically for recognising requests and complaints in call-center speech, we propose an approach using a PointWise-Convolution Recurrent network (PWCR) in this paper. Within the proposed approach, we first propose the pointwise-convolution module to perform layer-wise aggregation for the representations, from the multiple Transformer layers contained in a pre-trained foundation model. Then, a recurrent module is employed to capture effective temporal and contextual information, through a recurrent layer with multi-head self-attention. Subsequently, the experimental results on the HealthCall30 Corpus for request and complaint recognition in call-center speech indicate that, the proposed approach can achieve better recognition performance, compared with state-of-the-art approaches, resulting in unweighted average recalls of 78.7% (maximum) / 77.3% (average) and 60.6% (maximum) / 59.8% (average) for the request and complaint tasks, respectively.
KW - Call-center speech
KW - Complaint recognition
KW - Foundation models
KW - Pointwise-convolution recurrent network
KW - Request recognition
UR - http://www.scopus.com/inward/record.url?scp=85217683214&partnerID=8YFLogxK
U2 - 10.1007/s10772-025-10171-7
DO - 10.1007/s10772-025-10171-7
M3 - Article
AN - SCOPUS:85217683214
SN - 1381-2416
JO - International Journal of Speech Technology
JF - International Journal of Speech Technology
ER -