Data Scarcity: Methods to Improve the Quality of Text Classification

Ingo Glaser, Shabnam Sadegharmaki, Basil Komboz, Florian Matthes

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Legal document analysis is an important research area. The classification of clauses or sentences enables valuable insights such as the extraction of rights and obligations. However, datasets consisting of contracts or other legal documents are quite rare, particularly regarding the German language. The exorbitant cost of manually labeled data, especially in regard to text classification, is the motivation of many studies that suggest alternative methods to overcome the lack of labeled data. This paper experiments the effects of text data augmentation on the quality of classification tasks. While a large amount of techniques exists, this work examines a selected subset including semi-supervised learning methods and thesaurus-based data augmentation. We could not just show that thesaurus-based data augmentation as well as text augmentation with synonyms and hypernyms can improve the classification results, but also that the effect of such methods depends on the underlying data structure.

Original languageEnglish
Title of host publicationICPRAM 2021 - Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods, Volume 1
EditorsMaria De Marsico, Gabriella Sanniti di Baja, Ana L.N. Fred
PublisherScience and Technology Publications, Lda
Pages556-564
Number of pages9
ISBN (Print)9789897584862
DOIs
StatePublished - 2021
Event10th International Conference on Pattern Recognition Applications and Methods, ICPRAM 2021 - Virtual, Online
Duration: 4 Feb 20216 Feb 2021

Publication series

NameInternational Conference on Pattern Recognition Applications and Methods
Volume1
ISSN (Electronic)2184-4313

Conference

Conference10th International Conference on Pattern Recognition Applications and Methods, ICPRAM 2021
CityVirtual, Online
Period4/02/216/02/21

Keywords

  • Data Scarcity
  • Legal Text Analytics
  • Natural Language Processing
  • Text Classification

Fingerprint

Dive into the research topics of 'Data Scarcity: Methods to Improve the Quality of Text Classification'. Together they form a unique fingerprint.

Cite this