Application-level checkpointing for shared memory programs

Greg Bronevetsky, Daniel Marques, Keshav Pingali, Peter Szwed, Martin Schulz

Publikation: Beitrag in Buch/Bericht/KonferenzbandKonferenzbeitragBegutachtung

56 Zitate (Scopus)

Abstract

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR. Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs. In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks. One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.

OriginalspracheEnglisch
Titel11th International Conference on Architectural Support for Programming, Languages and Operating Systems, ASPLOS XI
Herausgeber (Verlag)Association for Computing Machinery
Seiten235-247
Seitenumfang13
ISBN (Print)1581138040, 9781581138047
DOIs
PublikationsstatusVeröffentlicht - 2004
Extern publiziertJa
Veranstaltung11th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XI - Boston, MA, USA/Vereinigte Staaten
Dauer: 9 Okt. 200413 Okt. 2004

Publikationsreihe

Name11th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XI

Konferenz

Konferenz11th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XI
Land/GebietUSA/Vereinigte Staaten
OrtBoston, MA
Zeitraum9/10/0413/10/04

Fingerprint

Untersuchen Sie die Forschungsthemen von „Application-level checkpointing for shared memory programs“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren