On-the-fly token similarity joins in relational databases

Nikolaus Augsten, Armando Miraglia, Thomas Neumann, Alfons Kemper

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Scopus citations

Abstract

Token similarity joins represent data items as sets of tokens, for example, strings are represented as sets of q-grams (substrings of length q). Two items are considered similar and match if their token sets have a large overlap. Previous work on similarity joins in databases mainly focuses on expressing the overlap computation with relational operators. The tokens are assumed to preexist in the database, and the token generation cannot be expressed as part of the query. Our goal is to efficiently compute token similarity joins on-the-fly, i.e., without any precomputed tokens or indexes. We define tokenize, a new relational operator that generates tokens and allows the similarity join to be fully integrated into relational databases. This allows us to (1) optimize the token generation as part of the query plan, (2) provide the query optimizer with cardinality estimates for tokens, (3) choose efficient algorithms based on the query context. We discuss algebraic properties, cardinality estimates, and an efficient iterator algorithm for tokenize. We implemented our operator in the kernel of PostgreSQL and empirically evaluated its performance for similarity joins.

Original languageEnglish
Title of host publicationSIGMOD 2014 - Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1495-1506
Number of pages12
ISBN (Print)9781450323765
DOIs
StatePublished - 2014
Event2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014 - Snowbird, UT, United States
Duration: 22 Jun 201427 Jun 2014

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Conference

Conference2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014
Country/TerritoryUnited States
CitySnowbird, UT
Period22/06/1427/06/14

Fingerprint

Dive into the research topics of 'On-the-fly token similarity joins in relational databases'. Together they form a unique fingerprint.

Cite this