Deep end-to-end representation learning for food type recognition from speech

Benjamin Sertolli, Abdulkadir Sengur, Nicholas Cummins, Björn W. Schuller

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

The use of Convolutional Neural Networks (CNN) pre-trained for a particular task, as a feature extractor for an alternate task, is a standard practice in many image classification paradigms. However, to date there have been comparatively few works exploring this technique for speech classification tasks. Herein, we utilise a pre-trained end-to-end Automatic Speech Recognition CNN as a feature extractor for the task of food-type recognition from speech. Furthermore, we also explore the benefits of Compact Bilinear Pooling for combining multiple feature representations extracted from the CNN. Key results presented indicate the suitability of this approach. When combined with a Recurrent Neural Network classifier, our strongest system achieves, for a seven-class food-type classification task an unweighted average recall of 73.3 % on the test set of the iHEARu-EAT database.

Original languageEnglish
Title of host publicationICMI 2018 - Proceedings of the 2018 International Conference on Multimodal Interaction
PublisherAssociation for Computing Machinery, Inc
Pages574-578
Number of pages5
ISBN (Electronic)9781450356923
DOIs
StatePublished - 2 Oct 2018
Externally publishedYes
Event20th ACM International Conference on Multimodal Interaction, ICMI 2018 - Boulder, United States
Duration: 16 Oct 201820 Oct 2018

Publication series

NameICMI 2018 - Proceedings of the 2018 International Conference on Multimodal Interaction

Conference

Conference20th ACM International Conference on Multimodal Interaction, ICMI 2018
Country/TerritoryUnited States
CityBoulder
Period16/10/1820/10/18

Keywords

  • Compact Bilinear Pooling
  • Deep Representation Learning
  • Eating Condition
  • End-to-End Learning
  • Recurrent Neural Networks

Fingerprint

Dive into the research topics of 'Deep end-to-end representation learning for food type recognition from speech'. Together they form a unique fingerprint.

Cite this