iCheck: Leveraging RDMA and Malleability for Application-Level Checkpointing in HPC Systems

Jophin John, Isaac David Nunez Araya, Michael Gerndt

Publikation: Beitrag in Buch/Bericht/KonferenzbandKonferenzbeitragBegutachtung

Abstract

The estimate that the mean time between failures will be in minutes in exascale supercomputers should be alarming for application developers. The inherent system's complexity, millions of components, and susceptibility to failures make checkpointing more relevant than ever. Since most high performance scientific applications contain an in-house checkpoint restart mechanism, their performance can be impacted by the contention of parallel file system resources. A shift in checkpointing strategies is needed to thwart this behavior. With iCheck, we present a novel checkpointing framework that supports malleable multilevel application-level checkpointing. We employ an RDMA enabled configurable multi-agent-based checkpoint transfer mechanism where minimal application resources are utilized for checkpointing. The high-level API of iCheck facilitates easy integration and malleability. We have added the iCheck library into the Is1 mardyn application providing performance improvement up to five thousand times over the in-house checkpointing mechanism. LULESH, Jacobi 2D heat simulation, and a synthetic application were also used for extensive analysis.

OriginalspracheEnglisch
TitelProceedings - 2022 IEEE 28th International Conference on Parallel and Distributed Systems, ICPADS 2022
Herausgeber (Verlag)IEEE Computer Society
Seiten467-474
Seitenumfang8
ISBN (elektronisch)9781665473156
DOIs
PublikationsstatusVeröffentlicht - 2023
Veranstaltung28th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2022 - Nanjing, China
Dauer: 10 Jan. 202312 Jan. 2023

Publikationsreihe

NameProceedings of the International Conference on Parallel and Distributed Systems - ICPADS
Band2023-January
ISSN (Print)1521-9097

Konferenz

Konferenz28th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2022
Land/GebietChina
OrtNanjing
Zeitraum10/01/2312/01/23

Fingerprint

Untersuchen Sie die Forschungsthemen von „iCheck: Leveraging RDMA and Malleability for Application-Level Checkpointing in HPC Systems“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren