Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs

Martin Schulz, Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Keshav Pingali, Paul Stodghill

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

17 Scopus citations

Abstract

The running times of many computational science applications are much longer than the mean-time-to-failure of current high-performance computing platforms. To run to completion, such applications must tolerate hardware failures. Checkpoint-and-restart (CPR) is the most commonly used scheme for accomplishing this - the state of the computation is saved periodically on stable storage, and when a hardware failure is detected, the computation is restarted from the most recently saved state. Most automatic CPR schemes in the literature can be classified as system-level checkpointing schemes because they take core-dump style snapshots of the computational state when all the processes are blocked at global barriers in the program. Unfortunately, a system that implements this style of checkpointing is tied to a particular platform; in addition, it cannot be used if there are no global barriers in the program. We are exploring an alternative called application-level, non-blocking checkpointing. In our approach, programs are transformed by a pre-processor so that they become self-checkpointing and self-restartable on any platform; there is also no assumption about the existence of global barriers in the code. In this paper, we describe our implementation of application-level, non-blocking checkpointing. We present experimental results on both a Windows cluster and a Compaq Alpha cluster, which show that the overheads introduced by our approach are small.

Original languageEnglish
Title of host publicationIEEE/ACM SC2004 Conference - Bridging Communities, Proceedings
Pages573-586
Number of pages14
StatePublished - 2004
Externally publishedYes
EventIEEE/ACM SC2004 Conference - Bridging Communities - Pittsburgh, PA, United States
Duration: 6 Nov 200412 Nov 2004

Publication series

NameIEEE/ACM SC2004 Conference, Proceedings

Conference

ConferenceIEEE/ACM SC2004 Conference - Bridging Communities
Country/TerritoryUnited States
CityPittsburgh, PA
Period6/11/0412/11/04

Fingerprint

Dive into the research topics of 'Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs'. Together they form a unique fingerprint.

Cite this