Abstract
Recently, innovative techniques for text processing like Latent Dirichlet Allocation (LDA) and embedding algorithms like Paragraph Vectors (PV) allowed for improved text classification and retrieval methods. Even though these methods can be adjusted to handle different text collections, they do not take advantage of the fixed document structure that is mandatory in many application areas. In this paper, we focus on patent data which mandates a fixed structure. We propose a new classification method which represents documents as Fixed Hierarchy Vectors (FHV), reflecting the document's structure. FHVs represent a document on multiple levels where each level represents the complete document but with a different local context. Furthermore, we sequentialize this representation and classify documents using LSTM-based architectures. Our experiments show that FHVs provide a richer document representation and that sequential classification improves classification performance when classifying patents into the International Patent Classification (IPC) taxonomy.
Original language | English |
---|---|
Pages | 495-503 |
Number of pages | 9 |
DOIs | |
State | Published - 2018 |
Event | 2018 SIAM International Conference on Data Mining, SDM 2018 - San Diego, United States Duration: 3 May 2018 → 5 May 2018 |
Conference
Conference | 2018 SIAM International Conference on Data Mining, SDM 2018 |
---|---|
Country/Territory | United States |
City | San Diego |
Period | 3/05/18 → 5/05/18 |
Keywords
- LSTM
- Patent classification
- Word embedding
- Word2vec