Abstract
The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master-worker applications, it provides few benefits for more common bulk synchronous MPI applications. To address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.
| Original language | English |
|---|---|
| Pages (from-to) | 305-319 |
| Number of pages | 15 |
| Journal | International Journal of High Performance Computing Applications |
| Volume | 30 |
| Issue number | 3 |
| DOIs | |
| State | Published - 1 Aug 2016 |
| Externally published | Yes |
Keywords
- MPI
- checkpointing
- failure recovery models
- fault tolerance
- molecular dynamics simulation
Fingerprint
Dive into the research topics of 'Evaluating and extending user-level fault tolerance in MPI applications'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver