TY - GEN
T1 - Efficient GPU Offloading with OpenMP for a Hyperbolic Finite Volume Solver on Dynamically Adaptive Meshes
AU - Wille, Mario
AU - Weinzierl, Tobias
AU - Brito Gadeschi, Gonzalo
AU - Bader, Michael
N1 - Publisher Copyright:
© 2023, The Author(s).
PY - 2023
Y1 - 2023
N2 - We identify and show how to overcome an OpenMP bottleneck in the administration of GPU memory. It arises for a wave equation solver on dynamically adaptive block-structured Cartesian meshes, which keeps all CPU threads busy and allows all of them to offload sets of patches to the GPU. Our studies show that multithreaded, concurrent, non-deterministic access to the GPU leads to performance breakdowns, since the GPU memory bookkeeping as offered through OpenMP’s map clause, i.e., the allocation and freeing, becomes another runtime challenge besides expensive data transfer and actual computation. We, therefore, propose to retain the memory management responsibility on the host: A caching mechanism acquires memory on the accelerator for all CPU threads, keeps hold of this memory and hands it out to the offloading threads upon demand. We show that this user-managed, CPU-based memory administration helps us to overcome the GPU memory bookkeeping bottleneck and speeds up the time-to-solution of Finite Volume kernels by more than an order of magnitude.
AB - We identify and show how to overcome an OpenMP bottleneck in the administration of GPU memory. It arises for a wave equation solver on dynamically adaptive block-structured Cartesian meshes, which keeps all CPU threads busy and allows all of them to offload sets of patches to the GPU. Our studies show that multithreaded, concurrent, non-deterministic access to the GPU leads to performance breakdowns, since the GPU memory bookkeeping as offered through OpenMP’s map clause, i.e., the allocation and freeing, becomes another runtime challenge besides expensive data transfer and actual computation. We, therefore, propose to retain the memory management responsibility on the host: A caching mechanism acquires memory on the accelerator for all CPU threads, keeps hold of this memory and hands it out to the offloading threads upon demand. We show that this user-managed, CPU-based memory administration helps us to overcome the GPU memory bookkeeping bottleneck and speeds up the time-to-solution of Finite Volume kernels by more than an order of magnitude.
KW - Dynamically adaptive mesh refinement
KW - GPU offloading
KW - Multithreading
KW - OpenMP
UR - http://www.scopus.com/inward/record.url?scp=85161250265&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-32041-5_4
DO - 10.1007/978-3-031-32041-5_4
M3 - Conference contribution
AN - SCOPUS:85161250265
SN - 9783031320408
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 65
EP - 85
BT - High Performance Computing - 38th International Conference, ISC High Performance 2023, Proceedings
A2 - Bhatele, Abhinav
A2 - Hammond, Jeff
A2 - Baboulin, Marc
A2 - Kruse, Carola
PB - Springer Science and Business Media Deutschland GmbH
T2 - 38th International Conference on High Performance Computing, ISC High Performance 2023
Y2 - 21 May 2023 through 25 May 2023
ER -