German abusive language dataset with focus on COVID-19

Maximilian Wich, Svenja Räther, Georg Groh

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

The COVID-19 pandemic has had a significant impact on human lives globally. As a result, it is unsurprising that it has influenced hate speech and other sorts of abusive language on social media. Machine learning models have been designed to automatically detect such posts and messages, which necessitate a significant amount of labeled data. Despite the relevance of the COVID-19 topic in the field of abusive language detection, no annotated datasets with this focus are available. To solve these shortfalls, we target to create such a dataset. Our contributions are as follows: (1) a methodology for collecting abusive language data from Twitter with a substantial amount of abusive and hateful content, and (2) a German abusive language dataset with 4,960 annotated tweets centered on COVID-19. Both the methodology and the dataset are intended to aid researchers in improving abusive language detection.

Original languageEnglish
Title of host publicationKONVENS 2021 - Proceedings of the 17th Conference on Natural Language Processing
PublisherKONVENS
Pages247-252
Number of pages6
ISBN (Electronic)9781954085831
StatePublished - 2021
Event17th Conference on Natural Language Processing, KONVENS 2021 - Dusseldorf, Germany
Duration: 6 Sep 20219 Sep 2021

Publication series

NameKONVENS 2021 - Proceedings of the 17th Conference on Natural Language Processing

Conference

Conference17th Conference on Natural Language Processing, KONVENS 2021
Country/TerritoryGermany
CityDusseldorf
Period6/09/219/09/21

Fingerprint

Dive into the research topics of 'German abusive language dataset with focus on COVID-19'. Together they form a unique fingerprint.

Cite this