TY - JOUR
T1 - A holistic scalable implementation approach of the lattice boltzmann method for CPU/GPU heterogeneous clusters
AU - Riesinger, Christoph
AU - Bakhtiari, Arash
AU - Schreiber, Martin
AU - Neumann, Philipp
AU - Bungartz, Hans Joachim
N1 - Publisher Copyright:
© 2017 by the authors.
PY - 2017/12/1
Y1 - 2017/12/1
N2 - Heterogeneous clusters are a widely utilized class of supercomputers assembled from different types of computing devices, for instance CPUs and GPUs, providing a huge computational potential. Programming them in a scalable way exploiting the maximal performance introduces numerous challenges such as optimizations for different computing devices, dealing with multiple levels of parallelism, the application of different programming models, work distribution, and hiding of communication with computation. We utilize the lattice Boltzmann method for fluid flow as a representative of a scientific computing application and develop a holistic implementation for large-scale CPU/GPU heterogeneous clusters. We review and combine a set of best practices and techniques ranging from optimizations for the particular computing devices to the orchestration of tens of thousands of CPU cores and thousands of GPUs. Eventually, we come up with an implementation using all the available computational resources for the lattice Boltzmann method operators. Our approach shows excellent scalability behavior making it future-proof for heterogeneous clusters of the upcoming architectures on the exaFLOPS scale. Parallel efficiencies of more than 90% are achieved leading to 2604.72 GLUPS utilizing 24,576 CPU cores and 2048 GPUs of the CPU/GPU heterogeneous cluster Piz Daint and computing more than 6.8 × 109 lattice cells.
AB - Heterogeneous clusters are a widely utilized class of supercomputers assembled from different types of computing devices, for instance CPUs and GPUs, providing a huge computational potential. Programming them in a scalable way exploiting the maximal performance introduces numerous challenges such as optimizations for different computing devices, dealing with multiple levels of parallelism, the application of different programming models, work distribution, and hiding of communication with computation. We utilize the lattice Boltzmann method for fluid flow as a representative of a scientific computing application and develop a holistic implementation for large-scale CPU/GPU heterogeneous clusters. We review and combine a set of best practices and techniques ranging from optimizations for the particular computing devices to the orchestration of tens of thousands of CPU cores and thousands of GPUs. Eventually, we come up with an implementation using all the available computational resources for the lattice Boltzmann method operators. Our approach shows excellent scalability behavior making it future-proof for heterogeneous clusters of the upcoming architectures on the exaFLOPS scale. Parallel efficiencies of more than 90% are achieved leading to 2604.72 GLUPS utilizing 24,576 CPU cores and 2048 GPUs of the CPU/GPU heterogeneous cluster Piz Daint and computing more than 6.8 × 109 lattice cells.
KW - GPU clusters
KW - Heterogeneous clusters
KW - Hybrid implementation
KW - Lattice Boltzmann method
KW - Multilevel parallelism
KW - Petascale
KW - Resource assignment
KW - Scalability
UR - http://www.scopus.com/inward/record.url?scp=85045400557&partnerID=8YFLogxK
U2 - 10.3390/computation5040048
DO - 10.3390/computation5040048
M3 - Article
AN - SCOPUS:85045400557
SN - 2079-3197
VL - 5
JO - Computation
JF - Computation
IS - 4
M1 - 48
ER -