TY - GEN
T1 - Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs
AU - Schulz, Martin
AU - Bronevetsky, Greg
AU - Fernandes, Rohit
AU - Marques, Daniel
AU - Pingali, Keshav
AU - Stodghill, Paul
N1 - Publisher Copyright:
© 2004 IEEE.
PY - 2004
Y1 - 2004
N2 - The running times of many computational science applications are much longer than the mean-time-to-failure of current high-performance computing platforms. To run to completion, such applications must tolerate hardware failures. Checkpoint-and-restart (CPR) is the most commonly used scheme for accomplishing this - the state of the computation is saved periodically on stable storage, and when a hardware failure is detected, the computation is restarted from the most recently saved state. Most automatic CPR schemes in the literature can be classified as system-level checkpointing schemes because they take core-dump style snapshots of the computational state when all the processes are blocked at global barriers in the program. Unfortunately, a system that implements this style of checkpointing is tied to a particular platform; in addition, it cannot be used if there are no global barriers in the program. We are exploring an alternative called application-level, non-blocking checkpointing. In our approach, programs are transformed by a pre-processor so that they become self-checkpointing and self-restartable on any platform; there is also no assumption about the existence of global barriers in the code. In this paper, we describe our implementation of application-level, non-blocking checkpointing. We present experimental results on both a Windows cluster and a Compaq Alpha cluster, which show that the overheads introduced by our approach are small.
AB - The running times of many computational science applications are much longer than the mean-time-to-failure of current high-performance computing platforms. To run to completion, such applications must tolerate hardware failures. Checkpoint-and-restart (CPR) is the most commonly used scheme for accomplishing this - the state of the computation is saved periodically on stable storage, and when a hardware failure is detected, the computation is restarted from the most recently saved state. Most automatic CPR schemes in the literature can be classified as system-level checkpointing schemes because they take core-dump style snapshots of the computational state when all the processes are blocked at global barriers in the program. Unfortunately, a system that implements this style of checkpointing is tied to a particular platform; in addition, it cannot be used if there are no global barriers in the program. We are exploring an alternative called application-level, non-blocking checkpointing. In our approach, programs are transformed by a pre-processor so that they become self-checkpointing and self-restartable on any platform; there is also no assumption about the existence of global barriers in the code. In this paper, we describe our implementation of application-level, non-blocking checkpointing. We present experimental results on both a Windows cluster and a Compaq Alpha cluster, which show that the overheads introduced by our approach are small.
UR - http://www.scopus.com/inward/record.url?scp=84934312471&partnerID=8YFLogxK
U2 - 10.1109/SC.2004.29
DO - 10.1109/SC.2004.29
M3 - Conference contribution
AN - SCOPUS:84934312471
T3 - Proceedings of the ACM/IEEE SC 2004 Conference: Bridging Communities
BT - Proceedings of the ACM/IEEE SC 2004 Conference
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2004 ACM/IEEE Conference on Supercomputing, SC 2004
Y2 - 6 November 2004 through 12 November 2004
ER -