TY - GEN
T1 - Evaluating user-level fault tolerance for MPI applications
AU - Laguna, Ignacio
AU - Richards, David F.
AU - Gamblin, Todd
AU - Schulz, Martin
AU - De Supinski, Bronis R.
N1 - Publisher Copyright:
© ACM 2014.
PY - 2014/9/9
Y1 - 2014/9/9
N2 - The User Level Failure Mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in MPI. Previous work has presented performance evaluations of the interface; yet questions related to its programability and applicability remain unanswered. In this paper, we present our experiences on using ULFM in a case study (a large molecular dynamics application) to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for applications with work-decomposition flexibility (e.g., master-slave), it provides few benefits for more general (e.g., bulk synchronous) MPI applications.
AB - The User Level Failure Mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in MPI. Previous work has presented performance evaluations of the interface; yet questions related to its programability and applicability remain unanswered. In this paper, we present our experiences on using ULFM in a case study (a large molecular dynamics application) to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for applications with work-decomposition flexibility (e.g., master-slave), it provides few benefits for more general (e.g., bulk synchronous) MPI applications.
KW - Failure receovery models
KW - Fault tolerance
KW - MPI
KW - Molecular dynamics simulation
UR - http://www.scopus.com/inward/record.url?scp=84958973677&partnerID=8YFLogxK
U2 - 10.1145/2642769.2642775
DO - 10.1145/2642769.2642775
M3 - Conference contribution
AN - SCOPUS:84958973677
T3 - ACM International Conference Proceeding Series
SP - 57
EP - 62
BT - Proceedings of the 21st European MPI Users' Group Meeting, EuroMPI/ASIA 2014
PB - Association for Computing Machinery
T2 - 21st European MPI Users' Group Meeting, EuroMPI/ASIA 2014
Y2 - 9 September 2014 through 12 September 2014
ER -