Abstract
Parsing is essential for a wide range of use cases, such as stream processing, bulk loading, and in-situ querying of raw data. Yet, the compute-intense step often constitutes a major bottleneck in the data ingestion pipeline, since parsing of inputs that require more involved parsing rules is challenging to parallelise. This work proposes a massively parallel algorithm for parsing delimiter-separated data formats on GPUs. Other than the state-of-the-art, the proposed approach does not require an initial sequential pass over the input to determine a thread's parsing context. That is, how a thread, beginning somewhere in the middle of the input, should interpret a certain symbol (e.g., whether to interpret a comma as a delimiter or as part of a larger string enclosed in double-quotes). Instead of tailoring the approach to a single format, we are able to perform a massively parallel finite state machine (FSM) simulation, which is more exible and powerful, supporting more expressive parsing rules with general applicability. Achieving a parsing rate of as much as 14.2 GB/s, our experimental evaluation on a GPU with 3 584 cores shows that the presented approach is able to scale to thousands of cores and beyond. With an endto- end streaming approach, we are able to exploit the fullduplex capabilities of the PCIe bus and hide latency from data transfers. Considering the end-to-end performance, the algorithm parses 4:8 GB in as little as 0:44 seconds, including data transfers.
Original language | English |
---|---|
Pages (from-to) | 616-628 |
Number of pages | 13 |
Journal | Proceedings of the VLDB Endowment |
Volume | 13 |
Issue number | 5 |
DOIs | |
State | Published - 2020 |
Event | 46th International Conference on Very Large Data Bases, VLDB 2020 - Virtual, Japan Duration: 31 Aug 2020 → 4 Sep 2020 |