TY - GEN
T1 - Flow-Join
T2 - 32nd IEEE International Conference on Data Engineering, ICDE 2016
AU - Rodiger, Wolf
AU - Idicula, Sam
AU - Kemper, Alfons
AU - Neumann, Thomas
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/6/22
Y1 - 2016/6/22
N2 - Modern InfiniBand interconnects offer link speeds of several gigabytes per second and a remote direct memory access (RDMA) paradigm for zero-copy network communication. Both are crucial for parallel database systems to achieve scalable distributed query processing where adding a server to the cluster increases performance. However, the scalability of distributed joins is threatened by unexpected data characteristics: Skew can cause a severe load imbalance such that a single server has to process a much larger part of the input than its fair share and by this slows down the entire distributed query. We introduce Flow-Join, a novel distributed join algorithm that handles attribute value skew with minimal overhead. Flow-Join detects heavy hitters at runtime using small approximate histograms and adapts the redistribution scheme to resolve load imbalances before they impact the join performance. Previous approaches often involve expensive analysis phases, which slow down distributed join processing for non-skewed workloads. This is especially the case for modern high-speed interconnects, which are too fast to hide the extra computation. Other skew handling approaches require detailed statistics, which are often not available or overly inaccurate for intermediate results. In contrast, Flow-Join uses our novel lightweight skew handling scheme to execute at the full network speed of more than 6 GB/s for InfiniBand 4×FDR, joining a skewed input at 11.5 billion tuples/s with 32 servers. This is 6.8× faster than a standard distributed hash join using the same hardware. At the same time, Flow-Join does not compromise the join performance for non-skewed workloads.
AB - Modern InfiniBand interconnects offer link speeds of several gigabytes per second and a remote direct memory access (RDMA) paradigm for zero-copy network communication. Both are crucial for parallel database systems to achieve scalable distributed query processing where adding a server to the cluster increases performance. However, the scalability of distributed joins is threatened by unexpected data characteristics: Skew can cause a severe load imbalance such that a single server has to process a much larger part of the input than its fair share and by this slows down the entire distributed query. We introduce Flow-Join, a novel distributed join algorithm that handles attribute value skew with minimal overhead. Flow-Join detects heavy hitters at runtime using small approximate histograms and adapts the redistribution scheme to resolve load imbalances before they impact the join performance. Previous approaches often involve expensive analysis phases, which slow down distributed join processing for non-skewed workloads. This is especially the case for modern high-speed interconnects, which are too fast to hide the extra computation. Other skew handling approaches require detailed statistics, which are often not available or overly inaccurate for intermediate results. In contrast, Flow-Join uses our novel lightweight skew handling scheme to execute at the full network speed of more than 6 GB/s for InfiniBand 4×FDR, joining a skewed input at 11.5 billion tuples/s with 32 servers. This is 6.8× faster than a standard distributed hash join using the same hardware. At the same time, Flow-Join does not compromise the join performance for non-skewed workloads.
UR - http://www.scopus.com/inward/record.url?scp=84980411154&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2016.7498324
DO - 10.1109/ICDE.2016.7498324
M3 - Conference contribution
AN - SCOPUS:84980411154
T3 - 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016
SP - 1194
EP - 1205
BT - 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 16 May 2016 through 20 May 2016
ER -