TY - JOUR
T1 - 'Are You Playing a Shooter Again?!' Deep Representation Learning for Audio-Based Video Game Genre Recognition
AU - Amiriparian, Shahin
AU - Cummins, Nicholas
AU - Gerczuk, Maurice
AU - Pugachevskiy, Sergey
AU - Ottl, Sandra
AU - Schuller, Bjorn
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2020/6
Y1 - 2020/6
N2 - In this paper, we present a novel computer audition task: audio-based video game genre classification. The aim of this study is threefold: 1) to check the feasibility of the proposed task; 2) to introduce a new corpus: The Game Genre by Audio + Multimodal Extracts (G$^{2}$AME), collected entirely from social multimedia; and 3) to compare the efficacy of various acoustic feature spaces to classify the G$^{2}$AME corpus into six game genres using a linear support vector machine classifier. For the classification we extract three different feature representations from the game audio files: 1) Knowledge-based acoustic features; 2) Deep Spectrum features; and 3) quantized Deep Spectrum features using Bag-of-Audio-Words. The Deep Spectrum features are a deep-learning-based representation derived from forwarding the visual representations of the audio instances, in particular spectrograms, mel-spectrograms, chromagrams, and their deltas through deep task-independent pretrained CNNs. Specifically, activations of fully connected layers from three common image classification CNNs, GoogLeNet, AlexNet, and VGG16 are used as feature vectors. Results for the six-genre classification problem indicate the suitability of our deep learning approach for this task. Our best method achieves an accuracy of up to 66.9% unweighted average recall using tenfold cross-validation.
AB - In this paper, we present a novel computer audition task: audio-based video game genre classification. The aim of this study is threefold: 1) to check the feasibility of the proposed task; 2) to introduce a new corpus: The Game Genre by Audio + Multimodal Extracts (G$^{2}$AME), collected entirely from social multimedia; and 3) to compare the efficacy of various acoustic feature spaces to classify the G$^{2}$AME corpus into six game genres using a linear support vector machine classifier. For the classification we extract three different feature representations from the game audio files: 1) Knowledge-based acoustic features; 2) Deep Spectrum features; and 3) quantized Deep Spectrum features using Bag-of-Audio-Words. The Deep Spectrum features are a deep-learning-based representation derived from forwarding the visual representations of the audio instances, in particular spectrograms, mel-spectrograms, chromagrams, and their deltas through deep task-independent pretrained CNNs. Specifically, activations of fully connected layers from three common image classification CNNs, GoogLeNet, AlexNet, and VGG16 are used as feature vectors. Results for the six-genre classification problem indicate the suitability of our deep learning approach for this task. Our best method achieves an accuracy of up to 66.9% unweighted average recall using tenfold cross-validation.
KW - Audio classification
KW - convolutional neural network (CNN)
KW - deep learning
KW - game genre classification
UR - http://www.scopus.com/inward/record.url?scp=85087542526&partnerID=8YFLogxK
U2 - 10.1109/TG.2019.2894532
DO - 10.1109/TG.2019.2894532
M3 - Article
AN - SCOPUS:85087542526
SN - 2475-1502
VL - 12
SP - 145
EP - 154
JO - IEEE Transactions on Games
JF - IEEE Transactions on Games
IS - 2
M1 - 8620524
ER -