Seamless Integration of Parquet Files into Data Processing

Alice Rey, Michael Freitag, Thomas Neumann

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Relational database systems are still the most powerful tool for data analysis. However, the steps necessary to bring existing data into the database make them unattractive for data exploration, especially when the data is stored in data lakes where users often use Parquet files, a binary column-oriented file format. This paper presents a fast Parquet framework that tackles these problems without costly ETL steps. We incrementally collect information during query execution. We create statistics that enhance future queries. In addition, we split the file into chunks for which we store the data ranges. We call these synopses. They allow us to skip entire sections in future queries. We show that these techniques only add a minor overhead to the first query and are of benefit for future requests. Our evaluation demonstrates that our implementation can achieve comparable results to database relations and that we can outperform existing systems by up to an order of magnitude.

Original languageEnglish
Title of host publicationDatenbanksysteme fur Business, Technologie und Web, BTW 2023
EditorsBirgitta Konig-Ries, Stefanie Scherzinger, Wolfgang Lehner, Gottfried Vossen
PublisherGesellschaft fur Informatik (GI)
Pages235-258
Number of pages24
ISBN (Electronic)9783885797258
DOIs
StatePublished - 2023
Event2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023 - Dresden, Germany
Duration: 6 Mar 202310 Mar 2023

Publication series

NameLecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)
VolumeP-331
ISSN (Print)1617-5468
ISSN (Electronic)2944-7682

Conference

Conference2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023
Country/TerritoryGermany
CityDresden
Period6/03/2310/03/23

Fingerprint

Dive into the research topics of 'Seamless Integration of Parquet Files into Data Processing'. Together they form a unique fingerprint.

Cite this