TY - GEN
T1 - A Stencil Framework to Realize Large-Scale Computations beyond Device Memory Capacity on GPU Supercomputers
AU - Shimokawabe, Takashi
AU - Endo, Toshio
AU - Onodera, Naoyuki
AU - Aoki, Takayuki
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/9/22
Y1 - 2017/9/22
N2 - Stencil-based applications such as CFD have succeeded in obtaining high performance on GPU supercomputers. The problem sizes of these applications are limited by the GPU device memory capacity, which is typically smaller than the host memory. On GPU supercomputers, a locality improvement technique using temporal blocking method with memory swapping between host and device enables large computation beyond the device memory capacity. However, because the loop management of temporal blocking with data movement across these memories increase programming difficulty, the applying this methodology to the real stencil applications demands substantially higher programming cost. Our high-productivity stencil framework automatically applies temporal blocking to boundary exchange required for stencil computation and supports automatic memory swapping provided by a MPI/CUDA wrapper library. The framework-based application for the airflow in an urban city maintains 80% performance even with the twice larger than the GPU memory capacity and have demonstrated good weak scalability on the TSUBAME 2.5 supercomputer.
AB - Stencil-based applications such as CFD have succeeded in obtaining high performance on GPU supercomputers. The problem sizes of these applications are limited by the GPU device memory capacity, which is typically smaller than the host memory. On GPU supercomputers, a locality improvement technique using temporal blocking method with memory swapping between host and device enables large computation beyond the device memory capacity. However, because the loop management of temporal blocking with data movement across these memories increase programming difficulty, the applying this methodology to the real stencil applications demands substantially higher programming cost. Our high-productivity stencil framework automatically applies temporal blocking to boundary exchange required for stencil computation and supports automatic memory swapping provided by a MPI/CUDA wrapper library. The framework-based application for the airflow in an urban city maintains 80% performance even with the twice larger than the GPU memory capacity and have demonstrated good weak scalability on the TSUBAME 2.5 supercomputer.
UR - http://www.scopus.com/inward/record.url?scp=85032619007&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2017.97
DO - 10.1109/CLUSTER.2017.97
M3 - Conference contribution
AN - SCOPUS:85032619007
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 525
EP - 529
BT - Proceedings - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017
Y2 - 5 September 2017 through 8 September 2017
ER -