TY - GEN
T1 - Seamless Integration of Parquet Files into Data Processing
AU - Rey, Alice
AU - Freitag, Michael
AU - Neumann, Thomas
N1 - Publisher Copyright:
© 2023 Gesellschaft fur Informatik (GI). All rights reserved.
PY - 2023
Y1 - 2023
N2 - Relational database systems are still the most powerful tool for data analysis. However, the steps necessary to bring existing data into the database make them unattractive for data exploration, especially when the data is stored in data lakes where users often use Parquet files, a binary column-oriented file format. This paper presents a fast Parquet framework that tackles these problems without costly ETL steps. We incrementally collect information during query execution. We create statistics that enhance future queries. In addition, we split the file into chunks for which we store the data ranges. We call these synopses. They allow us to skip entire sections in future queries. We show that these techniques only add a minor overhead to the first query and are of benefit for future requests. Our evaluation demonstrates that our implementation can achieve comparable results to database relations and that we can outperform existing systems by up to an order of magnitude.
AB - Relational database systems are still the most powerful tool for data analysis. However, the steps necessary to bring existing data into the database make them unattractive for data exploration, especially when the data is stored in data lakes where users often use Parquet files, a binary column-oriented file format. This paper presents a fast Parquet framework that tackles these problems without costly ETL steps. We incrementally collect information during query execution. We create statistics that enhance future queries. In addition, we split the file into chunks for which we store the data ranges. We call these synopses. They allow us to skip entire sections in future queries. We show that these techniques only add a minor overhead to the first query and are of benefit for future requests. Our evaluation demonstrates that our implementation can achieve comparable results to database relations and that we can outperform existing systems by up to an order of magnitude.
UR - http://www.scopus.com/inward/record.url?scp=85149995560&partnerID=8YFLogxK
U2 - 10.18420/BTW2023-12
DO - 10.18420/BTW2023-12
M3 - Conference contribution
AN - SCOPUS:85149995560
T3 - Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)
SP - 235
EP - 258
BT - Datenbanksysteme fur Business, Technologie und Web, BTW 2023
A2 - Konig-Ries, Birgitta
A2 - Scherzinger, Stefanie
A2 - Lehner, Wolfgang
A2 - Vossen, Gottfried
PB - Gesellschaft fur Informatik (GI)
T2 - 2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023
Y2 - 6 March 2023 through 10 March 2023
ER -