TY - GEN
T1 - Deep end-to-end representation learning for food type recognition from speech
AU - Sertolli, Benjamin
AU - Sengur, Abdulkadir
AU - Cummins, Nicholas
AU - Schuller, Björn W.
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/10/2
Y1 - 2018/10/2
N2 - The use of Convolutional Neural Networks (CNN) pre-trained for a particular task, as a feature extractor for an alternate task, is a standard practice in many image classification paradigms. However, to date there have been comparatively few works exploring this technique for speech classification tasks. Herein, we utilise a pre-trained end-to-end Automatic Speech Recognition CNN as a feature extractor for the task of food-type recognition from speech. Furthermore, we also explore the benefits of Compact Bilinear Pooling for combining multiple feature representations extracted from the CNN. Key results presented indicate the suitability of this approach. When combined with a Recurrent Neural Network classifier, our strongest system achieves, for a seven-class food-type classification task an unweighted average recall of 73.3 % on the test set of the iHEARu-EAT database.
AB - The use of Convolutional Neural Networks (CNN) pre-trained for a particular task, as a feature extractor for an alternate task, is a standard practice in many image classification paradigms. However, to date there have been comparatively few works exploring this technique for speech classification tasks. Herein, we utilise a pre-trained end-to-end Automatic Speech Recognition CNN as a feature extractor for the task of food-type recognition from speech. Furthermore, we also explore the benefits of Compact Bilinear Pooling for combining multiple feature representations extracted from the CNN. Key results presented indicate the suitability of this approach. When combined with a Recurrent Neural Network classifier, our strongest system achieves, for a seven-class food-type classification task an unweighted average recall of 73.3 % on the test set of the iHEARu-EAT database.
KW - Compact Bilinear Pooling
KW - Deep Representation Learning
KW - Eating Condition
KW - End-to-End Learning
KW - Recurrent Neural Networks
UR - http://www.scopus.com/inward/record.url?scp=85056613029&partnerID=8YFLogxK
U2 - 10.1145/3242969.3243683
DO - 10.1145/3242969.3243683
M3 - Conference contribution
AN - SCOPUS:85056613029
T3 - ICMI 2018 - Proceedings of the 2018 International Conference on Multimodal Interaction
SP - 574
EP - 578
BT - ICMI 2018 - Proceedings of the 2018 International Conference on Multimodal Interaction
PB - Association for Computing Machinery, Inc
T2 - 20th ACM International Conference on Multimodal Interaction, ICMI 2018
Y2 - 16 October 2018 through 20 October 2018
ER -