Every row counts: Combining sketches and sampling for accurate group-by result estimates

Michael Freitag, Thomas Neumann

Publikation: KonferenzbeitragPapierBegutachtung

16 Zitate (Scopus)

Abstract

Database systems heavily rely upon cardinality estimates for finding efficient execution plans, and estimation errors can easily affect query execution times by large factors. One particularly difficult problem is estimating the result size of a group-by operator, or, in general, the number of distinct combinations of a set of attributes. In contrast to, e. g., estimating the selectivity of simple filter predicates, the resulting number of groups cannot be predicted reliably without examining the complete input. As a consequence, most existing systems have poor estimates for the number of distinct groups. However, scanning entire relations at optimization time is not feasible in practice. Also, precise group counts cannot be precomputed for every possible combination of attributes. For practical purposes, a cheap mechanism is thus required which can handle arbitrary attribute combinations efficiently and with high accuracy. In this work, we present a novel estimation framework that combines sketched full information over individual columns with random sampling to correct for correlation bias between attributes. This combination can estimate group counts for individual columns nearly perfectly, and for arbitrary column combinations with high accuracy. Extensive experiments show that these excellent results hold for both synthetic and real-world data sets. We demonstrate how this mechanism can be integrated into existing systems with low overhead, and how estimation time can be kept negligible by means of an efficient algorithm for sample scans.

OriginalspracheEnglisch
PublikationsstatusVeröffentlicht - 2019
Veranstaltung9th Biennial Conference on Innovative Data Systems Research, CIDR 2019 - Pacific Grove, USA/Vereinigte Staaten
Dauer: 13 Jan. 201916 Jan. 2019

Konferenz

Konferenz9th Biennial Conference on Innovative Data Systems Research, CIDR 2019
Land/GebietUSA/Vereinigte Staaten
OrtPacific Grove
Zeitraum13/01/1916/01/19

Fingerprint

Untersuchen Sie die Forschungsthemen von „Every row counts: Combining sketches and sampling for accurate group-by result estimates“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren