TY - GEN
T1 - VCMNet
T2 - 21st ACM International Conference on Multimodal Interaction, ICMI 2019
AU - Al Futaisi, Najla D.
AU - Zhang, Zixing
AU - Cristia, Alejandrina
AU - Warlaumont, Anne S.
AU - Schuller, Björn W.
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/10/14
Y1 - 2019/10/14
N2 - Using neural networks to classify infant vocalisations into important subclasses (such as crying versus speech) is an emergent task in speech technology. One of the biggest roadblocks standing in the way of progress lies in the datasets: The performance of a learning model is affected by the labelling quality and size of the dataset used, and infant vocalisation datasets with good quality labels tend to be small. In this paper, we assess the performance of three models for infant VoCalisation Maturity (VCM) trained with a large dataset annotated automatically using a purpose-built classifier and a small dataset annotated by highly trained human coders. The two datasets are used in three different training strategies, whose performance is compared against a baseline model. The first training strategy investigates adversarial training, while the second exploits multi-task learning as the neural network trains on both datasets simultaneously. In the final strategy, we integrate adversarial training and multi-task learning. All of the training strategies outperform the baseline, with the adversarial training strategy yielding the best results on the development set.
AB - Using neural networks to classify infant vocalisations into important subclasses (such as crying versus speech) is an emergent task in speech technology. One of the biggest roadblocks standing in the way of progress lies in the datasets: The performance of a learning model is affected by the labelling quality and size of the dataset used, and infant vocalisation datasets with good quality labels tend to be small. In this paper, we assess the performance of three models for infant VoCalisation Maturity (VCM) trained with a large dataset annotated automatically using a purpose-built classifier and a small dataset annotated by highly trained human coders. The two datasets are used in three different training strategies, whose performance is compared against a baseline model. The first training strategy investigates adversarial training, while the second exploits multi-task learning as the neural network trains on both datasets simultaneously. In the final strategy, we integrate adversarial training and multi-task learning. All of the training strategies outperform the baseline, with the adversarial training strategy yielding the best results on the development set.
KW - Infant vocalisation
KW - Prelinguistic analysis
KW - Weakly supervised learning
UR - http://www.scopus.com/inward/record.url?scp=85074903889&partnerID=8YFLogxK
U2 - 10.1145/3340555.3353751
DO - 10.1145/3340555.3353751
M3 - Conference contribution
AN - SCOPUS:85074903889
T3 - ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction
SP - 205
EP - 209
BT - ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction
A2 - Gao, Wen
A2 - Ling Meng, Helen Mei
A2 - Turk, Matthew
A2 - Fussell, Susan R.
A2 - Schuller, Bjorn
A2 - Schuller, Bjorn
A2 - Song, Yale
A2 - Yu, Kai
PB - Association for Computing Machinery, Inc
Y2 - 14 October 2019 through 18 October 2019
ER -