iCheck: Leveraging RDMA and Malleability for Application-Level Checkpointing in HPC Systems

Jophin John, Isaac David Nunez Araya, Michael Gerndt

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The estimate that the mean time between failures will be in minutes in exascale supercomputers should be alarming for application developers. The inherent system's complexity, millions of components, and susceptibility to failures make checkpointing more relevant than ever. Since most high performance scientific applications contain an in-house checkpoint restart mechanism, their performance can be impacted by the contention of parallel file system resources. A shift in checkpointing strategies is needed to thwart this behavior. With iCheck, we present a novel checkpointing framework that supports malleable multilevel application-level checkpointing. We employ an RDMA enabled configurable multi-agent-based checkpoint transfer mechanism where minimal application resources are utilized for checkpointing. The high-level API of iCheck facilitates easy integration and malleability. We have added the iCheck library into the Is1 mardyn application providing performance improvement up to five thousand times over the in-house checkpointing mechanism. LULESH, Jacobi 2D heat simulation, and a synthetic application were also used for extensive analysis.

Original languageEnglish
Title of host publicationProceedings - 2022 IEEE 28th International Conference on Parallel and Distributed Systems, ICPADS 2022
PublisherIEEE Computer Society
Pages467-474
Number of pages8
ISBN (Electronic)9781665473156
DOIs
StatePublished - 2023
Event28th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2022 - Nanjing, China
Duration: 10 Jan 202312 Jan 2023

Publication series

NameProceedings of the International Conference on Parallel and Distributed Systems - ICPADS
Volume2023-January
ISSN (Print)1521-9097

Conference

Conference28th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2022
Country/TerritoryChina
CityNanjing
Period10/01/2312/01/23

Keywords

  • Adaptive Checkpointing
  • Fault Tolerance
  • MPI
  • Malleable Checkpointing
  • RDMA

Fingerprint

Dive into the research topics of 'iCheck: Leveraging RDMA and Malleability for Application-Level Checkpointing in HPC Systems'. Together they form a unique fingerprint.

Cite this