TY - GEN
T1 - Data Scarcity
T2 - 10th International Conference on Pattern Recognition Applications and Methods, ICPRAM 2021
AU - Glaser, Ingo
AU - Sadegharmaki, Shabnam
AU - Komboz, Basil
AU - Matthes, Florian
N1 - Publisher Copyright:
© 2021 by SCITEPRESS - Science and Technology Publications, Lda. All rights reserved.
PY - 2021
Y1 - 2021
N2 - Legal document analysis is an important research area. The classification of clauses or sentences enables valuable insights such as the extraction of rights and obligations. However, datasets consisting of contracts or other legal documents are quite rare, particularly regarding the German language. The exorbitant cost of manually labeled data, especially in regard to text classification, is the motivation of many studies that suggest alternative methods to overcome the lack of labeled data. This paper experiments the effects of text data augmentation on the quality of classification tasks. While a large amount of techniques exists, this work examines a selected subset including semi-supervised learning methods and thesaurus-based data augmentation. We could not just show that thesaurus-based data augmentation as well as text augmentation with synonyms and hypernyms can improve the classification results, but also that the effect of such methods depends on the underlying data structure.
AB - Legal document analysis is an important research area. The classification of clauses or sentences enables valuable insights such as the extraction of rights and obligations. However, datasets consisting of contracts or other legal documents are quite rare, particularly regarding the German language. The exorbitant cost of manually labeled data, especially in regard to text classification, is the motivation of many studies that suggest alternative methods to overcome the lack of labeled data. This paper experiments the effects of text data augmentation on the quality of classification tasks. While a large amount of techniques exists, this work examines a selected subset including semi-supervised learning methods and thesaurus-based data augmentation. We could not just show that thesaurus-based data augmentation as well as text augmentation with synonyms and hypernyms can improve the classification results, but also that the effect of such methods depends on the underlying data structure.
KW - Data Scarcity
KW - Legal Text Analytics
KW - Natural Language Processing
KW - Text Classification
UR - http://www.scopus.com/inward/record.url?scp=85174603086&partnerID=8YFLogxK
U2 - 10.5220/0010268005560564
DO - 10.5220/0010268005560564
M3 - Conference contribution
AN - SCOPUS:85174603086
SN - 9789897584862
T3 - International Conference on Pattern Recognition Applications and Methods
SP - 556
EP - 564
BT - ICPRAM 2021 - Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods, Volume 1
A2 - De Marsico, Maria
A2 - Sanniti di Baja, Gabriella
A2 - Fred, Ana L.N.
PB - Science and Technology Publications, Lda
Y2 - 4 February 2021 through 6 February 2021
ER -