TY - JOUR
T1 - Practical planning and execution of groupjoin and nested aggregates
AU - Fent, Philipp
AU - Birler, Altan
AU - Neumann, Thomas
N1 - Publisher Copyright:
© 2022, The Author(s).
PY - 2023/11
Y1 - 2023/11
N2 - Groupjoins combine execution of a join and a subsequent group-by. They are common in analytical queries and occur in about [InlineEquation not available: see fulltext.] of the queries in TPC-H and TPC-DS. While they were originally invented to improve performance, efficient parallel execution of groupjoins can be limited by contention in many-core systems. Efficient implementations of groupjoins are highly desirable, as groupjoins are not only used to fuse group-by and join, but are also useful to efficiently execute nested aggregates. For these, the query optimizer needs to reason over the result of aggregation to optimally schedule it. Traditional systems quickly reach their limits of selectivity and cardinality estimations over computed columns and often treat group-by as an optimization barrier. In this paper, we present techniques to efficiently estimate, plan, and execute groupjoins and nested aggregates. We propose four novel techniques, aggregate estimates to predict the result distributions of aggregates, parallel groupjoin execution for scalable execution of groupjoins, index groupjoins, and a greedy eager aggregation optimization technique that introduces nested preaggregations to significantly improve execution plans. The resulting system has improved estimates, better execution plans, and a contention-free evaluation of groupjoins, which speeds up TPC-H and TPC-DS queries significantly.
AB - Groupjoins combine execution of a join and a subsequent group-by. They are common in analytical queries and occur in about [InlineEquation not available: see fulltext.] of the queries in TPC-H and TPC-DS. While they were originally invented to improve performance, efficient parallel execution of groupjoins can be limited by contention in many-core systems. Efficient implementations of groupjoins are highly desirable, as groupjoins are not only used to fuse group-by and join, but are also useful to efficiently execute nested aggregates. For these, the query optimizer needs to reason over the result of aggregation to optimally schedule it. Traditional systems quickly reach their limits of selectivity and cardinality estimations over computed columns and often treat group-by as an optimization barrier. In this paper, we present techniques to efficiently estimate, plan, and execute groupjoins and nested aggregates. We propose four novel techniques, aggregate estimates to predict the result distributions of aggregates, parallel groupjoin execution for scalable execution of groupjoins, index groupjoins, and a greedy eager aggregation optimization technique that introduces nested preaggregations to significantly improve execution plans. The resulting system has improved estimates, better execution plans, and a contention-free evaluation of groupjoins, which speeds up TPC-H and TPC-DS queries significantly.
KW - Parallel processing
KW - Query optimization
KW - Query processing
UR - http://www.scopus.com/inward/record.url?scp=85140448374&partnerID=8YFLogxK
U2 - 10.1007/s00778-022-00765-x
DO - 10.1007/s00778-022-00765-x
M3 - Article
AN - SCOPUS:85140448374
SN - 1066-8888
VL - 32
SP - 1165
EP - 1190
JO - VLDB Journal
JF - VLDB Journal
IS - 6
ER -