Abstract
Parsing is essential for a wide range of use cases, such as stream processing, bulk loading, and in-situ querying of raw data. Yet, the compute-intense step often constitutes a major bottleneck in the data ingestion pipeline, since parsing of inputs that require more involved parsing rules is challenging to parallelise. This work proposes a massively parallel algorithm for parsing delimiter-separated data formats on GPUs. Other than the state-of-the-art, the proposed approach does not require an initial sequential pass over the input to determine a thread's parsing context. That is, how a thread, beginning somewhere in the middle of the input, should interpret a certain symbol (e.g., whether to interpret a comma as a delimiter or as part of a larger string enclosed in double-quotes). Instead of tailoring the approach to a single format, we are able to perform a massively parallel finite state machine (FSM) simulation, which is more exible and powerful, supporting more expressive parsing rules with general applicability. Achieving a parsing rate of as much as 14.2 GB/s, our experimental evaluation on a GPU with 3 584 cores shows that the presented approach is able to scale to thousands of cores and beyond. With an endto- end streaming approach, we are able to exploit the fullduplex capabilities of the PCIe bus and hide latency from data transfers. Considering the end-to-end performance, the algorithm parses 4:8 GB in as little as 0:44 seconds, including data transfers.
Originalsprache | Englisch |
---|---|
Seiten (von - bis) | 616-628 |
Seitenumfang | 13 |
Fachzeitschrift | Proceedings of the VLDB Endowment |
Jahrgang | 13 |
Ausgabenummer | 5 |
DOIs | |
Publikationsstatus | Veröffentlicht - 2020 |
Veranstaltung | 46th International Conference on Very Large Data Bases, VLDB 2020 - Virtual, Japan Dauer: 31 Aug. 2020 → 4 Sept. 2020 |