TY - GEN
T1 - Lecture Video Highlights Detection from Speech
AU - Song, Meishu
AU - Aslan, Ilhan
AU - Parada-Cabaleiro, Emilia
AU - Yang, Zijiang
AU - André, Elisabeth
AU - Yamamoto, Yoshiharu
AU - Schuller, Björn W.
N1 - Publisher Copyright:
© 2024 European Signal Processing Conference, EUSIPCO. All rights reserved.
PY - 2024
Y1 - 2024
N2 - In interpersonal co-located and online teaching, lecturers highlight words and sentences in their speech in order to implicitly communicate that particular content is important. This social behaviour aimed to capture students’ attention becomes crucial in distance learning, where the teacher’s voice is an essential instrument to maximise students’ attention. To enable intelligent systems, such as smart tutors, to understand and replicate this social behaviour, the ability to automatically recognise speech-based highlighting is needed. To this end, we introduce a public corpus for automatic detection of speech-based highlighting in learning context. With “Highlighting” we refer to the emphasised content, i.e., the important content which lecturers try to emphasise (highlight) by attracting the listeners attention. The dataset is derived from YouTube tutorial videos featuring 104 different English speakers who cover different disciplines. In sum, the dataset, which will be made freely available to the community. In addition, to establish an analysis for the corpus, we report on a series of experiments with the best results being achieved with a combination of a VGG net and transformer architectures. Our initial results of 78.2 % Accuracy and 78.8 % Unweighted Average Recall (UAR), encourage us to believe that this new dataset will facilitate progress in speech processing research for education.
AB - In interpersonal co-located and online teaching, lecturers highlight words and sentences in their speech in order to implicitly communicate that particular content is important. This social behaviour aimed to capture students’ attention becomes crucial in distance learning, where the teacher’s voice is an essential instrument to maximise students’ attention. To enable intelligent systems, such as smart tutors, to understand and replicate this social behaviour, the ability to automatically recognise speech-based highlighting is needed. To this end, we introduce a public corpus for automatic detection of speech-based highlighting in learning context. With “Highlighting” we refer to the emphasised content, i.e., the important content which lecturers try to emphasise (highlight) by attracting the listeners attention. The dataset is derived from YouTube tutorial videos featuring 104 different English speakers who cover different disciplines. In sum, the dataset, which will be made freely available to the community. In addition, to establish an analysis for the corpus, we report on a series of experiments with the best results being achieved with a combination of a VGG net and transformer architectures. Our initial results of 78.2 % Accuracy and 78.8 % Unweighted Average Recall (UAR), encourage us to believe that this new dataset will facilitate progress in speech processing research for education.
KW - Highlighting Content
KW - Speech
KW - Transformer
KW - VGG
UR - https://www.scopus.com/pages/publications/85208443807
U2 - 10.23919/eusipco63174.2024.10715058
DO - 10.23919/eusipco63174.2024.10715058
M3 - Conference contribution
AN - SCOPUS:85208443807
T3 - European Signal Processing Conference
SP - 361
EP - 365
BT - 32nd European Signal Processing Conference, EUSIPCO 2024 - Proceedings
PB - European Signal Processing Conference, EUSIPCO
T2 - 32nd European Signal Processing Conference, EUSIPCO 2024
Y2 - 26 August 2024 through 30 August 2024
ER -