TY - GEN
T1 - iCheck
T2 - 28th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2022
AU - John, Jophin
AU - Araya, Isaac David Nunez
AU - Gerndt, Michael
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - The estimate that the mean time between failures will be in minutes in exascale supercomputers should be alarming for application developers. The inherent system's complexity, millions of components, and susceptibility to failures make checkpointing more relevant than ever. Since most high performance scientific applications contain an in-house checkpoint restart mechanism, their performance can be impacted by the contention of parallel file system resources. A shift in checkpointing strategies is needed to thwart this behavior. With iCheck, we present a novel checkpointing framework that supports malleable multilevel application-level checkpointing. We employ an RDMA enabled configurable multi-agent-based checkpoint transfer mechanism where minimal application resources are utilized for checkpointing. The high-level API of iCheck facilitates easy integration and malleability. We have added the iCheck library into the Is1 mardyn application providing performance improvement up to five thousand times over the in-house checkpointing mechanism. LULESH, Jacobi 2D heat simulation, and a synthetic application were also used for extensive analysis.
AB - The estimate that the mean time between failures will be in minutes in exascale supercomputers should be alarming for application developers. The inherent system's complexity, millions of components, and susceptibility to failures make checkpointing more relevant than ever. Since most high performance scientific applications contain an in-house checkpoint restart mechanism, their performance can be impacted by the contention of parallel file system resources. A shift in checkpointing strategies is needed to thwart this behavior. With iCheck, we present a novel checkpointing framework that supports malleable multilevel application-level checkpointing. We employ an RDMA enabled configurable multi-agent-based checkpoint transfer mechanism where minimal application resources are utilized for checkpointing. The high-level API of iCheck facilitates easy integration and malleability. We have added the iCheck library into the Is1 mardyn application providing performance improvement up to five thousand times over the in-house checkpointing mechanism. LULESH, Jacobi 2D heat simulation, and a synthetic application were also used for extensive analysis.
KW - Adaptive Checkpointing
KW - Fault Tolerance
KW - MPI
KW - Malleable Checkpointing
KW - RDMA
UR - http://www.scopus.com/inward/record.url?scp=85152957241&partnerID=8YFLogxK
U2 - 10.1109/ICPADS56603.2022.00067
DO - 10.1109/ICPADS56603.2022.00067
M3 - Conference contribution
AN - SCOPUS:85152957241
T3 - Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
SP - 467
EP - 474
BT - Proceedings - 2022 IEEE 28th International Conference on Parallel and Distributed Systems, ICPADS 2022
PB - IEEE Computer Society
Y2 - 10 January 2023 through 12 January 2023
ER -