TY - GEN
T1 - On-the-fly token similarity joins in relational databases
AU - Augsten, Nikolaus
AU - Miraglia, Armando
AU - Neumann, Thomas
AU - Kemper, Alfons
PY - 2014
Y1 - 2014
N2 - Token similarity joins represent data items as sets of tokens, for example, strings are represented as sets of q-grams (substrings of length q). Two items are considered similar and match if their token sets have a large overlap. Previous work on similarity joins in databases mainly focuses on expressing the overlap computation with relational operators. The tokens are assumed to preexist in the database, and the token generation cannot be expressed as part of the query. Our goal is to efficiently compute token similarity joins on-the-fly, i.e., without any precomputed tokens or indexes. We define tokenize, a new relational operator that generates tokens and allows the similarity join to be fully integrated into relational databases. This allows us to (1) optimize the token generation as part of the query plan, (2) provide the query optimizer with cardinality estimates for tokens, (3) choose efficient algorithms based on the query context. We discuss algebraic properties, cardinality estimates, and an efficient iterator algorithm for tokenize. We implemented our operator in the kernel of PostgreSQL and empirically evaluated its performance for similarity joins.
AB - Token similarity joins represent data items as sets of tokens, for example, strings are represented as sets of q-grams (substrings of length q). Two items are considered similar and match if their token sets have a large overlap. Previous work on similarity joins in databases mainly focuses on expressing the overlap computation with relational operators. The tokens are assumed to preexist in the database, and the token generation cannot be expressed as part of the query. Our goal is to efficiently compute token similarity joins on-the-fly, i.e., without any precomputed tokens or indexes. We define tokenize, a new relational operator that generates tokens and allows the similarity join to be fully integrated into relational databases. This allows us to (1) optimize the token generation as part of the query plan, (2) provide the query optimizer with cardinality estimates for tokens, (3) choose efficient algorithms based on the query context. We discuss algebraic properties, cardinality estimates, and an efficient iterator algorithm for tokenize. We implemented our operator in the kernel of PostgreSQL and empirically evaluated its performance for similarity joins.
UR - http://www.scopus.com/inward/record.url?scp=84904364824&partnerID=8YFLogxK
U2 - 10.1145/2588555.2610530
DO - 10.1145/2588555.2610530
M3 - Conference contribution
AN - SCOPUS:84904364824
SN - 9781450323765
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 1495
EP - 1506
BT - SIGMOD 2014 - Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
PB - Association for Computing Machinery
T2 - 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014
Y2 - 22 June 2014 through 27 June 2014
ER -