To Partition, or Not to Partition, That is the Join Question in a Real System

Maximilian Bandle, Jana Giceva, Thomas Neumann

Research output: Contribution to journalConference articlepeer-review

27 Scopus citations

Abstract

An efficient implementation of a hash join has been a highly researched problem for decades. Recently, the radix join has been shown to have superior performance over the alternatives (e.g., the non-partitioned hash join), albeit on synthetic microbenchmarks. Therefore, it is unclear whether one can simply replace the hash join in an RDBMS or use the radix join as a performance booster for selected queries. If the latter, it is still unknown when one should rely on the radix join to improve performance. In this paper, we address these questions, show how to integrate the radix join in Umbra, a code-generating DBMS, and make it competitive for selective queries by introducing a Bloom-filter based semi-join reducer. We have evaluated how well it runs when used in queries from more representative workloads like TPC-H. Surprisingly, the radix join brings a noticeable improvement in only one out of all 59 joins in TPC-H. Thus, with an extensive range of microbenchmarks, we have isolated the effects of the most important workload factors and synthesized the range of values where partitioning the data for the radix join pays off. Our analysis shows that the benefit of data partitioning quickly diminishes as soon as we deviate from the optimal parameters, and even late materialization rarely helps in real workloads. We thus, conclude that integrating the radix join within a code-generating database rarely justifies the increase in code and optimizer complexity and advise against it for processing real-world workloads.

Original languageEnglish
Pages (from-to)168-180
Number of pages13
JournalProceedings of the ACM SIGMOD International Conference on Management of Data
DOIs
StatePublished - 2021
Event2021 International Conference on Management of Data, SIGMOD 2021 - Virtual, Online, China
Duration: 20 Jun 202125 Jun 2021

Keywords

  • in-memory databases
  • join processing
  • modern hardware
  • partitioning
  • performance evaluation

Fingerprint

Dive into the research topics of 'To Partition, or Not to Partition, That is the Join Question in a Real System'. Together they form a unique fingerprint.

Cite this