Every row counts: Combining sketches and sampling for accurate group-by result estimates

Michael Freitag, Thomas Neumann

Research output: Contribution to conferencePaperpeer-review

16 Scopus citations

Abstract

Database systems heavily rely upon cardinality estimates for finding efficient execution plans, and estimation errors can easily affect query execution times by large factors. One particularly difficult problem is estimating the result size of a group-by operator, or, in general, the number of distinct combinations of a set of attributes. In contrast to, e. g., estimating the selectivity of simple filter predicates, the resulting number of groups cannot be predicted reliably without examining the complete input. As a consequence, most existing systems have poor estimates for the number of distinct groups. However, scanning entire relations at optimization time is not feasible in practice. Also, precise group counts cannot be precomputed for every possible combination of attributes. For practical purposes, a cheap mechanism is thus required which can handle arbitrary attribute combinations efficiently and with high accuracy. In this work, we present a novel estimation framework that combines sketched full information over individual columns with random sampling to correct for correlation bias between attributes. This combination can estimate group counts for individual columns nearly perfectly, and for arbitrary column combinations with high accuracy. Extensive experiments show that these excellent results hold for both synthetic and real-world data sets. We demonstrate how this mechanism can be integrated into existing systems with low overhead, and how estimation time can be kept negligible by means of an efficient algorithm for sample scans.

Original languageEnglish
StatePublished - 2019
Event9th Biennial Conference on Innovative Data Systems Research, CIDR 2019 - Pacific Grove, United States
Duration: 13 Jan 201916 Jan 2019

Conference

Conference9th Biennial Conference on Innovative Data Systems Research, CIDR 2019
Country/TerritoryUnited States
CityPacific Grove
Period13/01/1916/01/19

Fingerprint

Dive into the research topics of 'Every row counts: Combining sketches and sampling for accurate group-by result estimates'. Together they form a unique fingerprint.

Cite this