TY - JOUR
T1 - Evaluating and extending user-level fault tolerance in MPI applications
AU - Laguna, Ignacio
AU - Richards, David F.
AU - Gamblin, Todd
AU - Schulz, Martin
AU - De Supinski, Bronis R.
AU - Mohror, Kathryn
AU - Pritchard, Howard
N1 - Publisher Copyright:
© SAGE Publications.
PY - 2016/8/1
Y1 - 2016/8/1
N2 - The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master-worker applications, it provides few benefits for more common bulk synchronous MPI applications. To address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.
AB - The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master-worker applications, it provides few benefits for more common bulk synchronous MPI applications. To address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.
KW - MPI
KW - checkpointing
KW - failure recovery models
KW - fault tolerance
KW - molecular dynamics simulation
UR - http://www.scopus.com/inward/record.url?scp=84983438495&partnerID=8YFLogxK
U2 - 10.1177/1094342015623623
DO - 10.1177/1094342015623623
M3 - Article
AN - SCOPUS:84983438495
SN - 1094-3420
VL - 30
SP - 305
EP - 319
JO - International Journal of High Performance Computing Applications
JF - International Journal of High Performance Computing Applications
IS - 3
ER -