ParPaRaw: Massively parallel parsing of delimiter-separated raw data

Elias Stehle, Hans Arno Jacobsen

Publikation: Beitrag in FachzeitschriftKonferenzartikelBegutachtung

11 Zitate (Scopus)

Abstract

Parsing is essential for a wide range of use cases, such as stream processing, bulk loading, and in-situ querying of raw data. Yet, the compute-intense step often constitutes a major bottleneck in the data ingestion pipeline, since parsing of inputs that require more involved parsing rules is challenging to parallelise. This work proposes a massively parallel algorithm for parsing delimiter-separated data formats on GPUs. Other than the state-of-the-art, the proposed approach does not require an initial sequential pass over the input to determine a thread's parsing context. That is, how a thread, beginning somewhere in the middle of the input, should interpret a certain symbol (e.g., whether to interpret a comma as a delimiter or as part of a larger string enclosed in double-quotes). Instead of tailoring the approach to a single format, we are able to perform a massively parallel finite state machine (FSM) simulation, which is more exible and powerful, supporting more expressive parsing rules with general applicability. Achieving a parsing rate of as much as 14.2 GB/s, our experimental evaluation on a GPU with 3 584 cores shows that the presented approach is able to scale to thousands of cores and beyond. With an endto- end streaming approach, we are able to exploit the fullduplex capabilities of the PCIe bus and hide latency from data transfers. Considering the end-to-end performance, the algorithm parses 4:8 GB in as little as 0:44 seconds, including data transfers.

OriginalspracheEnglisch
Seiten (von - bis)616-628
Seitenumfang13
FachzeitschriftProceedings of the VLDB Endowment
Jahrgang13
Ausgabenummer5
DOIs
PublikationsstatusVeröffentlicht - 2020
Veranstaltung46th International Conference on Very Large Data Bases, VLDB 2020 - Virtual, Japan
Dauer: 31 Aug. 20204 Sept. 2020

Fingerprint

Untersuchen Sie die Forschungsthemen von „ParPaRaw: Massively parallel parsing of delimiter-separated raw data“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren