TY - GEN
T1 - A memory bandwidth-efficient hybrid radix Sort on GPUs
AU - Stehle, Elias
AU - Jacobsen, Hans Arno
N1 - Publisher Copyright:
© 2017 ACM.
PY - 2017/5/9
Y1 - 2017/5/9
N2 - Sorting is at the core of many database operations, such as index creation, sort-merge joins, and user-requested output sorting. As GPUs are emerging as a promising platform to accelerate various operations, sorting on GPUs becomes a viable endeavour. Over the past few years, several improvements have been proposed for sorting on GPUs, leading to the first radix sort implementations that achieve a sorting rate of over one billion 32-bit keys per second. Yet, state-of-the-art approaches are heavily memory bandwidthbound, as they require substantially more memory transfers than their CPU-based counterparts. Our work proposes a novel approach that almost halves the amount of memory transfers and, therefore, considerably lifts the memory bandwidth limitation. Being able to sort two gigabytes of eightbyte records in as little as 50 milliseconds, our approach achieves a 2:32-fold improvement over the state-of-the-art GPU-based radix sort for uniform distributions, sustaining a minimum speed-up of no less than a factor of 1:66 for skewed distributions. To address inputs that either do not reside on the GPU or exceed the available device memory, we build on our efficient GPU sorting approach with a pipelined heterogeneous sorting algorithm that mitigates the overhead associated with PCIe data transfers. Comparing the end-toend sorting performance to the state-of-the-art CPU-based radix sort running 16 threads, our heterogeneous approach achieves a 2:06-fold and a 1:53-fold improvement for sorting 64 GB key-value pairs with a skewed and a uniform distribution, respectively.
AB - Sorting is at the core of many database operations, such as index creation, sort-merge joins, and user-requested output sorting. As GPUs are emerging as a promising platform to accelerate various operations, sorting on GPUs becomes a viable endeavour. Over the past few years, several improvements have been proposed for sorting on GPUs, leading to the first radix sort implementations that achieve a sorting rate of over one billion 32-bit keys per second. Yet, state-of-the-art approaches are heavily memory bandwidthbound, as they require substantially more memory transfers than their CPU-based counterparts. Our work proposes a novel approach that almost halves the amount of memory transfers and, therefore, considerably lifts the memory bandwidth limitation. Being able to sort two gigabytes of eightbyte records in as little as 50 milliseconds, our approach achieves a 2:32-fold improvement over the state-of-the-art GPU-based radix sort for uniform distributions, sustaining a minimum speed-up of no less than a factor of 1:66 for skewed distributions. To address inputs that either do not reside on the GPU or exceed the available device memory, we build on our efficient GPU sorting approach with a pipelined heterogeneous sorting algorithm that mitigates the overhead associated with PCIe data transfers. Comparing the end-toend sorting performance to the state-of-the-art CPU-based radix sort running 16 threads, our heterogeneous approach achieves a 2:06-fold and a 1:53-fold improvement for sorting 64 GB key-value pairs with a skewed and a uniform distribution, respectively.
UR - http://www.scopus.com/inward/record.url?scp=85021221920&partnerID=8YFLogxK
U2 - 10.1145/3035918.3064043
DO - 10.1145/3035918.3064043
M3 - Conference contribution
AN - SCOPUS:85021221920
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 417
EP - 432
BT - SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data
PB - Association for Computing Machinery
T2 - 2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017
Y2 - 14 May 2017 through 19 May 2017
ER -