TY - GEN
T1 - Audio-based eating analysis and tracking utilising deep spectrum features
AU - Amiriparian, Shahin
AU - Ottl, Sandra
AU - Gerczuk, Maurice
AU - Pugachevskiy, Sergey
AU - Schuller, Bjorn
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/11
Y1 - 2019/11
N2 - This This paper proposes a deep learning system for audio-based eating analysis on the ICMI 2018 Eating Analysis and Tracking (EAT) challenge corpus. We utilise Deep Spectrum features which are image classification convolutional neural network (CNN) descriptors. We extract the Deep Spectrum features by forwarding Mel-spectrograms from input audio through deep task-independent pre-trained CNNs, including AlexNet and VGG16. We then use the activations of first (fc6), second (fc7), and third (fc8) fully connected layers from these networks as feature vectors. We obtain the best classification result by using the first fully connected layer (fc6) of AlexNet for extracting the features from Mel-spectrograms with a window size of 160 ms and a hop size of 80 ms and a viridis colour map. Finally, we build Bag-of-Deep-Features (BoDF) which is the quantisation of the Deep Spectrum features. In comparison to the best baseline results on the test partitions of the Food Type and the Likability sub-challenges, unweighted average recall is increased from 67.2 percent to 79.9 percent and from 54.2 percent to 56.1 percent, respectively. For the test partition of the Difficulty sub-challenge the concordance correlation coefficient is increased from.506 to.509.
AB - This This paper proposes a deep learning system for audio-based eating analysis on the ICMI 2018 Eating Analysis and Tracking (EAT) challenge corpus. We utilise Deep Spectrum features which are image classification convolutional neural network (CNN) descriptors. We extract the Deep Spectrum features by forwarding Mel-spectrograms from input audio through deep task-independent pre-trained CNNs, including AlexNet and VGG16. We then use the activations of first (fc6), second (fc7), and third (fc8) fully connected layers from these networks as feature vectors. We obtain the best classification result by using the first fully connected layer (fc6) of AlexNet for extracting the features from Mel-spectrograms with a window size of 160 ms and a hop size of 80 ms and a viridis colour map. Finally, we build Bag-of-Deep-Features (BoDF) which is the quantisation of the Deep Spectrum features. In comparison to the best baseline results on the test partitions of the Food Type and the Likability sub-challenges, unweighted average recall is increased from 67.2 percent to 79.9 percent and from 54.2 percent to 56.1 percent, respectively. For the test partition of the Difficulty sub-challenge the concordance correlation coefficient is increased from.506 to.509.
KW - Audio processing
KW - Deep Spectrum features
KW - Eating analysis
KW - Pre-trained convolutional neural networks
UR - http://www.scopus.com/inward/record.url?scp=85079354466&partnerID=8YFLogxK
U2 - 10.1109/EHB47216.2019.8970058
DO - 10.1109/EHB47216.2019.8970058
M3 - Conference contribution
AN - SCOPUS:85079354466
T3 - 2019 7th E-Health and Bioengineering Conference, EHB 2019
BT - 2019 7th E-Health and Bioengineering Conference, EHB 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 7th IEEE International Conference on E-Health and Bioengineering, EHB 2019
Y2 - 21 November 2019 through 23 November 2019
ER -