'Are You Playing a Shooter Again?!' Deep Representation Learning for Audio-Based Video Game Genre Recognition

Shahin Amiriparian, Nicholas Cummins, Maurice Gerczuk, Sergey Pugachevskiy, Sandra Ottl, Bjorn Schuller

Research output: Contribution to journalArticlepeer-review

10 Scopus citations


In this paper, we present a novel computer audition task: audio-based video game genre classification. The aim of this study is threefold: 1) to check the feasibility of the proposed task; 2) to introduce a new corpus: The Game Genre by Audio + Multimodal Extracts (G$^{2}$AME), collected entirely from social multimedia; and 3) to compare the efficacy of various acoustic feature spaces to classify the G$^{2}$AME corpus into six game genres using a linear support vector machine classifier. For the classification we extract three different feature representations from the game audio files: 1) Knowledge-based acoustic features; 2) Deep Spectrum features; and 3) quantized Deep Spectrum features using Bag-of-Audio-Words. The Deep Spectrum features are a deep-learning-based representation derived from forwarding the visual representations of the audio instances, in particular spectrograms, mel-spectrograms, chromagrams, and their deltas through deep task-independent pretrained CNNs. Specifically, activations of fully connected layers from three common image classification CNNs, GoogLeNet, AlexNet, and VGG16 are used as feature vectors. Results for the six-genre classification problem indicate the suitability of our deep learning approach for this task. Our best method achieves an accuracy of up to 66.9% unweighted average recall using tenfold cross-validation.

Original languageEnglish
Article number8620524
Pages (from-to)145-154
Number of pages10
JournalIEEE Transactions on Games
Issue number2
StatePublished - Jun 2020
Externally publishedYes


  • Audio classification
  • convolutional neural network (CNN)
  • deep learning
  • game genre classification


Dive into the research topics of ''Are You Playing a Shooter Again?!' Deep Representation Learning for Audio-Based Video Game Genre Recognition'. Together they form a unique fingerprint.

Cite this