ParPaRaw: Massively parallel parsing of delimiter-separated raw data

Elias Stehle, Hans Arno Jacobsen

Research output: Contribution to journalConference articlepeer-review

9 Scopus citations


Parsing is essential for a wide range of use cases, such as stream processing, bulk loading, and in-situ querying of raw data. Yet, the compute-intense step often constitutes a major bottleneck in the data ingestion pipeline, since parsing of inputs that require more involved parsing rules is challenging to parallelise. This work proposes a massively parallel algorithm for parsing delimiter-separated data formats on GPUs. Other than the state-of-the-art, the proposed approach does not require an initial sequential pass over the input to determine a thread's parsing context. That is, how a thread, beginning somewhere in the middle of the input, should interpret a certain symbol (e.g., whether to interpret a comma as a delimiter or as part of a larger string enclosed in double-quotes). Instead of tailoring the approach to a single format, we are able to perform a massively parallel finite state machine (FSM) simulation, which is more exible and powerful, supporting more expressive parsing rules with general applicability. Achieving a parsing rate of as much as 14.2 GB/s, our experimental evaluation on a GPU with 3 584 cores shows that the presented approach is able to scale to thousands of cores and beyond. With an endto- end streaming approach, we are able to exploit the fullduplex capabilities of the PCIe bus and hide latency from data transfers. Considering the end-to-end performance, the algorithm parses 4:8 GB in as little as 0:44 seconds, including data transfers.

Original languageEnglish
Pages (from-to)616-628
Number of pages13
JournalProceedings of the VLDB Endowment
Issue number5
StatePublished - 2020
Event46th International Conference on Very Large Data Bases, VLDB 2020 - Virtual, Japan
Duration: 31 Aug 20204 Sep 2020


Dive into the research topics of 'ParPaRaw: Massively parallel parsing of delimiter-separated raw data'. Together they form a unique fingerprint.

Cite this