TY - GEN
T1 - Towards an Efficient Deep Learning Model for Emotion and Theme Recognition in Music
AU - Rajamani, Srividya Tirunellai
AU - Rajamani, Kumar
AU - Schuller, Bjorn W.
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Emotion and theme recognition in music plays a vital role in music information retrieval and recommendation systems. Deep learning based techniques have shown great promise in this regard. Realising optimal network configurations with least number of floating point operations per second (FLOPS) and model parameters is of paramount importance to obtain efficient deployable models, especially for resource constrained hardware. We propose a novel integration of stand-alone self-attention into a Visual Geometry Group (VGG)-like network for the task of multi-label emotion and theme recognition in music. Through extensive experimental evaluation, we discover the ideal and optimal integration of stand-alone self-attention which leads to substantial reduction in number of parameters and FLOPS, yet yielding better performance. We benchmark our results on the autotagging-moodtheme subset of the MTG-Jamendo dataset. Using mel-spectrogram as the input, we demonstrate that our proposed SA-VGG network requires 55 % fewer parameters and 60 % fewer FLOPS while improving the baseline ROC-AUC and PR-AUC.
AB - Emotion and theme recognition in music plays a vital role in music information retrieval and recommendation systems. Deep learning based techniques have shown great promise in this regard. Realising optimal network configurations with least number of floating point operations per second (FLOPS) and model parameters is of paramount importance to obtain efficient deployable models, especially for resource constrained hardware. We propose a novel integration of stand-alone self-attention into a Visual Geometry Group (VGG)-like network for the task of multi-label emotion and theme recognition in music. Through extensive experimental evaluation, we discover the ideal and optimal integration of stand-alone self-attention which leads to substantial reduction in number of parameters and FLOPS, yet yielding better performance. We benchmark our results on the autotagging-moodtheme subset of the MTG-Jamendo dataset. Using mel-spectrogram as the input, we demonstrate that our proposed SA-VGG network requires 55 % fewer parameters and 60 % fewer FLOPS while improving the baseline ROC-AUC and PR-AUC.
KW - VGG
KW - automatic music tagging
KW - multi-label classification
KW - music emotion recognition
KW - self-attention
UR - http://www.scopus.com/inward/record.url?scp=85127478668&partnerID=8YFLogxK
U2 - 10.1109/MMSP53017.2021.9733532
DO - 10.1109/MMSP53017.2021.9733532
M3 - Conference contribution
AN - SCOPUS:85127478668
T3 - IEEE 23rd International Workshop on Multimedia Signal Processing, MMSP 2021
BT - IEEE 23rd International Workshop on Multimedia Signal Processing, MMSP 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 23rd IEEE International Workshop on Multimedia Signal Processing, MMSP 2021
Y2 - 6 October 2021 through 8 October 2021
ER -