Communication-Optimal Parallel Reservoir Sampling

Christian Winter, Moritz Sichert, Altan Birler, Thomas Neumann, Alfons Kemper

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

When evaluating complex analytical queries on high-velocity data streams, many systems cannot run those queries on all elements of a stream. Sampling is a widely used method to reduce the system load by replacing the input with a representative yet manageable subset. For unbounded data, reservoir sampling generates a fixed-size uniform sample independent of the input cardinality. However, the collection of reservoir samples itself can already be a bottleneck for high-velocity data. In this paper, we introduce a technique that allows fully parallelizing reservoir sampling for many-core architectures. Our approach relies on the efficient combination of thread-local samples taken over chunks of the input without necessitating communication during the sampling phase and with minimal communication when merging. We show how our efficient merge guarantees uniform random samples while allowing data to be distributed over worker threads arbitrarily. Our analysis of this approach within the Umbra database system demonstrates linear scaling along the available threads and the ability to sustain high-velocity workloads.

Original languageEnglish
Title of host publicationDatenbanksysteme fur Business, Technologie und Web, BTW 2023
EditorsBirgitta Konig-Ries, Stefanie Scherzinger, Wolfgang Lehner, Gottfried Vossen
PublisherGesellschaft fur Informatik (GI)
Pages567-578
Number of pages12
ISBN (Electronic)9783885797258
DOIs
StatePublished - 2023
Event2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023 - Dresden, Germany
Duration: 6 Mar 202310 Mar 2023

Publication series

NameLecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)
VolumeP-331
ISSN (Print)1617-5468

Conference

Conference2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023
Country/TerritoryGermany
CityDresden
Period6/03/2310/03/23

Keywords

  • Parallel Sampling
  • Reservoir Sampling
  • Stream Processing

Fingerprint

Dive into the research topics of 'Communication-Optimal Parallel Reservoir Sampling'. Together they form a unique fingerprint.

Cite this