TY - JOUR
T1 - Graph-based multi-core higher-order time integration of linear autonomous partial differential equations
AU - Huber, Dominik
AU - Schreiber, Martin
AU - Schulz, Martin
N1 - Publisher Copyright:
© 2021 Elsevier B.V.
PY - 2021/7
Y1 - 2021/7
N2 - Modern high-performance computing (HPC) systems rely on increasingly complex nodes with a steadily growing number of cores and matching deep memory hierarchies. In order to fully exploit them, algorithms must be explicitly designed to exploit these features. In this work we address this challenge for a widely used class of application kernels: polynomial-based time integration of linear autonomous partial differential equations. We build on prior work [1] of a cache-aware, yet sequential solution and provide an innovative way to parallelize it, while addressing cache-awareness across a large number of cores. For this, we introduce a dependency graph driven view of the algorithm and then use both static graph partitioning and dynamic scheduling to efficiently map the execution to the underlying platform. We implement our approach on top of the widely available Intel Threading Building Blocks (TBB) library, although the concepts are programming model agnostic and can apply to any task-driven parallel programming approach. We demonstrate the performance of our approach for a 2nd, 4th and 6th order time integration of the linear advection equation on three different architectures with widely varying memory systems and achieve an up to 60% reduction of wall clock time compared to a conventional, state-of-the-art non-cache-aware approach.
AB - Modern high-performance computing (HPC) systems rely on increasingly complex nodes with a steadily growing number of cores and matching deep memory hierarchies. In order to fully exploit them, algorithms must be explicitly designed to exploit these features. In this work we address this challenge for a widely used class of application kernels: polynomial-based time integration of linear autonomous partial differential equations. We build on prior work [1] of a cache-aware, yet sequential solution and provide an innovative way to parallelize it, while addressing cache-awareness across a large number of cores. For this, we introduce a dependency graph driven view of the algorithm and then use both static graph partitioning and dynamic scheduling to efficiently map the execution to the underlying platform. We implement our approach on top of the widely available Intel Threading Building Blocks (TBB) library, although the concepts are programming model agnostic and can apply to any task-driven parallel programming approach. We demonstrate the performance of our approach for a 2nd, 4th and 6th order time integration of the linear advection equation on three different architectures with widely varying memory systems and achieve an up to 60% reduction of wall clock time compared to a conventional, state-of-the-art non-cache-aware approach.
KW - Cache-blocking in time dimension
KW - Higher-order time integration
KW - Matrix exponentiation
KW - Task-based parallelization
UR - http://www.scopus.com/inward/record.url?scp=85107021104&partnerID=8YFLogxK
U2 - 10.1016/j.jocs.2021.101349
DO - 10.1016/j.jocs.2021.101349
M3 - Article
AN - SCOPUS:85107021104
SN - 1877-7503
VL - 53
JO - Journal of Computational Science
JF - Journal of Computational Science
M1 - 101349
ER -