TY - GEN
T1 - SPA
T2 - 39th IEEE International Conference on Data Engineering, ICDE 2023
AU - Boncz, Peter
AU - Chronis, Yannis
AU - Finis, Jan
AU - Halfpap, Stefan
AU - Leis, Viktor
AU - Neumann, Thomas
AU - Nica, Anisoara
AU - Sauer, Caetano
AU - Stolze, Knut
AU - Zukowski, Marcin
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Selective queries are not uncommon in large-scale data analytics, for example, when drilling down into a specific customer in a dashboard. Traditionally, selective queries are accelerated by creating secondary indexes. However, because of their large size, expensive maintenance, and difficulty to tune and automate, indexes are typically not used in modern cloud data warehouses or data lakes. Instead, such systems rely mostly on full table scans and lightweight optimizations like min/max filtering, whose effectiveness depends heavily on the data layout and value distributions.We propose SPA as the vision for automatically optimizing selective queries for immutable copy-on-write data formats. SPA adaptively indexes subsets of the data in an incremental and workload-driven manner. It makes fine-grained decisions and continuously monitors their benefit, dynamically allocating an optimization budget in a way that bounds the additional cost of indexing. Furthermore, it guarantees a performance improvement in the cases where indexes - potentially partial ones - prove to be beneficial. When indexes lose their benefit due to a shifting workload, they are gradually deconstructed in favor of optimizations that accommodate recent trends. As SPA does not require information about updates performed on the data, it can also be employed as an accelerator for systems that do not control the data, e.g., in cloud data lakes.
AB - Selective queries are not uncommon in large-scale data analytics, for example, when drilling down into a specific customer in a dashboard. Traditionally, selective queries are accelerated by creating secondary indexes. However, because of their large size, expensive maintenance, and difficulty to tune and automate, indexes are typically not used in modern cloud data warehouses or data lakes. Instead, such systems rely mostly on full table scans and lightweight optimizations like min/max filtering, whose effectiveness depends heavily on the data layout and value distributions.We propose SPA as the vision for automatically optimizing selective queries for immutable copy-on-write data formats. SPA adaptively indexes subsets of the data in an incremental and workload-driven manner. It makes fine-grained decisions and continuously monitors their benefit, dynamically allocating an optimization budget in a way that bounds the additional cost of indexing. Furthermore, it guarantees a performance improvement in the cases where indexes - potentially partial ones - prove to be beneficial. When indexes lose their benefit due to a shifting workload, they are gradually deconstructed in favor of optimizations that accommodate recent trends. As SPA does not require information about updates performed on the data, it can also be employed as an accelerator for systems that do not control the data, e.g., in cloud data lakes.
UR - http://www.scopus.com/inward/record.url?scp=85167664594&partnerID=8YFLogxK
U2 - 10.1109/ICDE55515.2023.00302
DO - 10.1109/ICDE55515.2023.00302
M3 - Conference contribution
AN - SCOPUS:85167664594
T3 - Proceedings - International Conference on Data Engineering
SP - 3740
EP - 3746
BT - Proceedings - 2023 IEEE 39th International Conference on Data Engineering, ICDE 2023
PB - IEEE Computer Society
Y2 - 3 April 2023 through 7 April 2023
ER -