TY - GEN
T1 - SeisSol on Distributed Multi-GPU Systems
T2 - 2021 International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2021
AU - Dorozhinskii, Ravil
AU - Bader, Michael
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/1/20
Y1 - 2021/1/20
N2 - We present a GPU implementation of the high order Discontinuous Galerkin (DG) scheme in SeisSol, a software package for simulating seismic waves and earthquake dynamics. Our particular focus ison providing a performance portable solution for heterogeneous distributed multi-GPU systems. We therefore redesigned SeisSol's code generation cascade for GPU programming models. This includes CUDAsource code generation for the performance-critical small batched matrix multiplications kernels. The parallelisation extends the existing MPI+X scheme and supports SeisSol's cluster-wise Local TimeStepping (LTS) algorithm for ADER time integration. We performed a Roofline model analysis to ensure that the generated batched matrix operations achieve the performance limits posed by the memory-bandwidth roofline. Our results also demonstrate that the generated GPU kernels outperform the corresponding cuBLAS subroutines by 2.5 times on average. We present strong and weak scaling studies of our implementation on the Marconi100 supercomputer (with 4 Nvidia Volta V100 GPUs per node) on up to 256 GPUs , which revealed good parallel performance and efficiency in case of time integration using global time stepping. However, we show that directly mapping the LTS method from CPUs to distributed GPU environments results in lower hardware utilization. Nevertheless, due to the algorithmic advantages of local time stepping, the method still reduces time-to-solution by a factor of 1.3 on average in contrast to the GTS scheme.
AB - We present a GPU implementation of the high order Discontinuous Galerkin (DG) scheme in SeisSol, a software package for simulating seismic waves and earthquake dynamics. Our particular focus ison providing a performance portable solution for heterogeneous distributed multi-GPU systems. We therefore redesigned SeisSol's code generation cascade for GPU programming models. This includes CUDAsource code generation for the performance-critical small batched matrix multiplications kernels. The parallelisation extends the existing MPI+X scheme and supports SeisSol's cluster-wise Local TimeStepping (LTS) algorithm for ADER time integration. We performed a Roofline model analysis to ensure that the generated batched matrix operations achieve the performance limits posed by the memory-bandwidth roofline. Our results also demonstrate that the generated GPU kernels outperform the corresponding cuBLAS subroutines by 2.5 times on average. We present strong and weak scaling studies of our implementation on the Marconi100 supercomputer (with 4 Nvidia Volta V100 GPUs per node) on up to 256 GPUs , which revealed good parallel performance and efficiency in case of time integration using global time stepping. However, we show that directly mapping the LTS method from CPUs to distributed GPU environments results in lower hardware utilization. Nevertheless, due to the algorithmic advantages of local time stepping, the method still reduces time-to-solution by a factor of 1.3 on average in contrast to the GTS scheme.
KW - ADER
KW - Discontinuous Galerkin
KW - GPU
KW - SeisSol
KW - code generation
KW - high performance computing
KW - local time stepping
KW - seismic wave propagation
UR - http://www.scopus.com/inward/record.url?scp=85099876394&partnerID=8YFLogxK
U2 - 10.1145/3432261.3436753
DO - 10.1145/3432261.3436753
M3 - Conference contribution
AN - SCOPUS:85099876394
T3 - ACM International Conference Proceeding Series
SP - 69
EP - 82
BT - Proceedings of International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2021
PB - Association for Computing Machinery
Y2 - 20 January 2021 through 22 January 2021
ER -