Evaluating user-level fault tolerance for MPI applications

Ignacio Laguna, David F. Richards, Todd Gamblin, Martin Schulz, Bronis R. De Supinski

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

28 Scopus citations

Abstract

The User Level Failure Mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in MPI. Previous work has presented performance evaluations of the interface; yet questions related to its programability and applicability remain unanswered. In this paper, we present our experiences on using ULFM in a case study (a large molecular dynamics application) to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for applications with work-decomposition flexibility (e.g., master-slave), it provides few benefits for more general (e.g., bulk synchronous) MPI applications.

Original languageEnglish
Title of host publicationProceedings of the 21st European MPI Users' Group Meeting, EuroMPI/ASIA 2014
PublisherAssociation for Computing Machinery
Pages57-62
Number of pages6
ISBN (Electronic)9781450328753
DOIs
StatePublished - 9 Sep 2014
Externally publishedYes
Event21st European MPI Users' Group Meeting, EuroMPI/ASIA 2014 - Kyoto, Japan
Duration: 9 Sep 201412 Sep 2014

Publication series

NameACM International Conference Proceeding Series
Volume09-12-September-2014

Conference

Conference21st European MPI Users' Group Meeting, EuroMPI/ASIA 2014
Country/TerritoryJapan
CityKyoto
Period9/09/1412/09/14

Keywords

  • Failure receovery models
  • Fault tolerance
  • MPI
  • Molecular dynamics simulation

Fingerprint

Dive into the research topics of 'Evaluating user-level fault tolerance for MPI applications'. Together they form a unique fingerprint.

Cite this