Evaluating and extending user-level fault tolerance in MPI applications

Ignacio Laguna, David F. Richards, Todd Gamblin, Martin Schulz, Bronis R. De Supinski, Kathryn Mohror, Howard Pritchard

Research output: Contribution to journalArticlepeer-review

37 Scopus citations

Abstract

The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master-worker applications, it provides few benefits for more common bulk synchronous MPI applications. To address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.

Original languageEnglish
Pages (from-to)305-319
Number of pages15
JournalInternational Journal of High Performance Computing Applications
Volume30
Issue number3
DOIs
StatePublished - 1 Aug 2016
Externally publishedYes

Keywords

  • MPI
  • checkpointing
  • failure recovery models
  • fault tolerance
  • molecular dynamics simulation

Fingerprint

Dive into the research topics of 'Evaluating and extending user-level fault tolerance in MPI applications'. Together they form a unique fingerprint.

Cite this