Augment to prevent: Short-text data augmentation in deep learning for hate-speech classification

Georgios Rizos, Konstantin Hemker, Björn Schuller

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

102 Scopus citations

Abstract

In this paper, we address the issue of augmenting text data in supervised Natural Language Processing problems, exemplified by deep online hate speech classification. A great challenge in this domain is that although the presence of hate speech can be deleterious to the quality of service provided by social platforms, it still comprises only a tiny fraction of the content that can be found online, which can lead to performance deterioration due to majority class overfitting. To this end, we perform a thorough study on the application of deep learning to the hate speech detection problem: a) we propose three text-based data augmentation techniques aimed at reducing the degree of class imbalance and to maximise the amount of information we can extract from our limited resources and b) we apply them on a selection of top-performing deep architectures and hate speech databases in order to showcase their generalisation properties. The data augmentation techniques are based on a) synonym replacement based on word embedding vector closeness, b) warping of the word tokens along the padded sequence or c) class-conditional, recurrent neural language generation. Our proposed framework yields a significant increase in multi-class hate speech detection, outperforming the baseline in the largest online hate speech database by an absolute 5.7 % increase in Macro-F1 score and 30 % in hate speech class recall.

Original languageEnglish
Title of host publicationCIKM 2019 - Proceedings of the 28th ACM International Conference on Information and Knowledge Management
PublisherAssociation for Computing Machinery
Pages991-1000
Number of pages10
ISBN (Electronic)9781450369763
DOIs
StatePublished - 3 Nov 2019
Externally publishedYes
Event28th ACM International Conference on Information and Knowledge Management, CIKM 2019 - Beijing, China
Duration: 3 Nov 20197 Nov 2019

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Conference

Conference28th ACM International Conference on Information and Knowledge Management, CIKM 2019
Country/TerritoryChina
CityBeijing
Period3/11/197/11/19

Keywords

  • Class imbalance
  • Online hate speech detection
  • Short text data augmentation

Fingerprint

Dive into the research topics of 'Augment to prevent: Short-text data augmentation in deep learning for hate-speech classification'. Together they form a unique fingerprint.

Cite this