TY - GEN
T1 - Lbl2Vec
T2 - 17th International Conference on Web Information Systems and Technologies, WEBIST 2021
AU - Schopf, Tim
AU - Braun, Daniel
AU - Matthes, Florian
N1 - Publisher Copyright:
© 2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved.
PY - 2021
Y1 - 2021
N2 - In this paper, we consider the task of retrieving documents with predefined topics from an unlabeled document dataset using an unsupervised approach. The proposed unsupervised approach requires only a small number of keywords describing the respective topics and no labeled document. Existing approaches either heavily relied on a large amount of additionally encoded world knowledge or on term-document frequencies. Contrariwise, we introduce a method that learns jointly embedded document and word vectors solely from the unlabeled document dataset in order to find documents that are semantically similar to the topics described by the keywords. The proposed method requires almost no text preprocessing but is simultaneously effective at retrieving relevant documents with high probability. When successively retrieving documents on different predefined topics from publicly available and commonly used datasets, we achieved an average area under the receiver operating characteristic curve value of 0.95 on one dataset and 0.92 on another. Further, our method can be used for multiclass document classification, without the need to assign labels to the dataset in advance. Compared with an unsupervised classification baseline, we increased F1 scores from 76.6 to 82.7 and from 61.0 to 75.1 on the respective datasets. For easy replication of our approach, we make the developed Lbl2Vec code publicly available as a ready-to-use tool under the 3-Clause BSD license.
AB - In this paper, we consider the task of retrieving documents with predefined topics from an unlabeled document dataset using an unsupervised approach. The proposed unsupervised approach requires only a small number of keywords describing the respective topics and no labeled document. Existing approaches either heavily relied on a large amount of additionally encoded world knowledge or on term-document frequencies. Contrariwise, we introduce a method that learns jointly embedded document and word vectors solely from the unlabeled document dataset in order to find documents that are semantically similar to the topics described by the keywords. The proposed method requires almost no text preprocessing but is simultaneously effective at retrieving relevant documents with high probability. When successively retrieving documents on different predefined topics from publicly available and commonly used datasets, we achieved an average area under the receiver operating characteristic curve value of 0.95 on one dataset and 0.92 on another. Further, our method can be used for multiclass document classification, without the need to assign labels to the dataset in advance. Compared with an unsupervised classification baseline, we increased F1 scores from 76.6 to 82.7 and from 61.0 to 75.1 on the respective datasets. For easy replication of our approach, we make the developed Lbl2Vec code publicly available as a ready-to-use tool under the 3-Clause BSD license.
KW - Document Retrieval
KW - Natural Language Processing
KW - Unsupervised Document Classification
UR - http://www.scopus.com/inward/record.url?scp=85132287748&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85132287748
T3 - International Conference on Web Information Systems and Technologies, WEBIST - Proceedings
SP - 124
EP - 132
BT - WEBIST 2021 - Proceedings of the 17th International Conference on Web Information Systems and Technologies
A2 - Mayo, Francisco Dominguez
A2 - Marchiori, Massimo
A2 - Filipe, Joaquim
PB - Science and Technology Publications, Lda
Y2 - 26 October 2021 through 28 October 2021
ER -