Abstract
Parsing is essential for a wide range of use cases, such as stream processing, bulk loading, and in-situ querying of raw data. Yet, the compute-intense step often constitutes a major bottleneck in the data ingestion pipeline, since parsing of inputs that require more involved parsing rules is challenging to parallelise. This work proposes a massively parallel algorithm for parsing delimiter-separated data formats on GPUs. Other than the state-of-the-art, the proposed approach does not require an initial sequential pass over the input to determine a thread's parsing context. That is, how a thread, beginning somewhere in the middle of the input, should interpret a certain symbol (e.g., whether to interpret a comma as a delimiter or as part of a larger string enclosed in double-quotes). Instead of tailoring the approach to a single format, we are able to perform a massively parallel finite state machine (FSM) simulation, which is more exible and powerful, supporting more expressive parsing rules with general applicability. Achieving a parsing rate of as much as 14.2 GB/s, our experimental evaluation on a GPU with 3 584 cores shows that the presented approach is able to scale to thousands of cores and beyond. With an endto- end streaming approach, we are able to exploit the fullduplex capabilities of the PCIe bus and hide latency from data transfers. Considering the end-to-end performance, the algorithm parses 4:8 GB in as little as 0:44 seconds, including data transfers.
| Original language | English |
|---|---|
| Pages (from-to) | 616-628 |
| Number of pages | 13 |
| Journal | Proceedings of the VLDB Endowment |
| Volume | 13 |
| Issue number | 5 |
| DOIs | |
| State | Published - 2020 |
| Event | 46th International Conference on Very Large Data Bases, VLDB 2020 - Virtual, Japan Duration: 31 Aug 2020 → 4 Sep 2020 |
Fingerprint
Dive into the research topics of 'ParPaRaw: Massively parallel parsing of delimiter-separated raw data'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver