TY - JOUR
T1 - A practical approach to groupjoin and nested aggregates
AU - Fent, Philipp
AU - Neumann, Thomas
N1 - Publisher Copyright:
© 2021, VLDB Endowment. All rights reserved.
PY - 2021
Y1 - 2021
N2 - Groupjoins, the combined execution of a join and a subsequent group by, are common in analytical queries, and occur in about 1/8 of the queries in TPC-H and TPC-DS. While they were originally invented to improve performance, efficient parallel execution of groupjoins can be limited by contention, which limits their usefulness in a many-core system. Having an efficient implementation of groupjoins is highly desirable, as groupjoins are not only used to fuse group by and join but are also introduced by the unnesting component of the query optimizer to avoid nested-loops evaluation of aggregates. Furthermore, the query optimizer needs be able to reason over the result of aggregation in order to schedule it correctly. Traditional selectivity and cardinality estimations quickly reach their limits when faced with computed columns from nested aggregates, which leads to poor cost estimations and thus, suboptimal query plans. In this paper, we present techniques to efficiently estimate, plan, and execute groupjoins and nested aggregates. We propose two novel techniques, aggregate estimates to predict the result distribution of aggregates, and parallel groupjoin execution for a scalable execution of groupjoins. The resulting system has significantly better estimates and a contention-free evaluation of groupjoins, which can speed up some TPC-H queries up to a factor of 2.
AB - Groupjoins, the combined execution of a join and a subsequent group by, are common in analytical queries, and occur in about 1/8 of the queries in TPC-H and TPC-DS. While they were originally invented to improve performance, efficient parallel execution of groupjoins can be limited by contention, which limits their usefulness in a many-core system. Having an efficient implementation of groupjoins is highly desirable, as groupjoins are not only used to fuse group by and join but are also introduced by the unnesting component of the query optimizer to avoid nested-loops evaluation of aggregates. Furthermore, the query optimizer needs be able to reason over the result of aggregation in order to schedule it correctly. Traditional selectivity and cardinality estimations quickly reach their limits when faced with computed columns from nested aggregates, which leads to poor cost estimations and thus, suboptimal query plans. In this paper, we present techniques to efficiently estimate, plan, and execute groupjoins and nested aggregates. We propose two novel techniques, aggregate estimates to predict the result distribution of aggregates, and parallel groupjoin execution for a scalable execution of groupjoins. The resulting system has significantly better estimates and a contention-free evaluation of groupjoins, which can speed up some TPC-H queries up to a factor of 2.
UR - http://www.scopus.com/inward/record.url?scp=85119678620&partnerID=8YFLogxK
U2 - 10.14778/3476249.3476288
DO - 10.14778/3476249.3476288
M3 - Conference article
AN - SCOPUS:85119678620
SN - 2150-8097
VL - 14
SP - 2383
EP - 2396
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 11
T2 - 47th International Conference on Very Large Data Bases, VLDB 2021
Y2 - 16 August 2021 through 20 August 2021
ER -