MPI Runtime Error Detection with MUST: A Scalable and Crash-Safe Approach

Joachim Protze, Tobias Hilbrich, Martin Schulz, Bronis R. De Supinski, Wolfgang E. Nagel, Matthias S. Müller

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

The Message Passing Interface (MPI) is a widely used paradigm for distributed memory programming. Implementations of this interface are designed for good performance rather than on usability extensions that enforce their correct use. Runtime MPI usage error detection tools aid application developers in the correct use of this interface. Since usage errors can cause failures that lead to an application crash, it is crucial that runtime error detection tools employ techniques that allow them to finish all of their correctness checks. This includes situations in which the application is interrupted by the MPI library, due to an incorrect function call, and operating system signals after fatal errors like division by zero or faulty memory accesses. We present an approach that uses an alternative tool communication means along with signal and error handling capabilities. A study of the assumptions that enable this approach details its applicability for different use cases and compares it to less efficient schemes that rely on synchronous processing and/or communication. Additionally, we enable bandwidth efficient communication with a scalable propagation technique that raises the awareness of an application crash within the tool. An application study with the SPEC MPI2007 benchmark suite demonstrates the applicability of our approach for up to 2,048 processes. Overhead measurements underline that our application crash handling increases the runtime of our runtime error detection tool by only 4% in average.

Original languageEnglish
Title of host publicationProceedings - 43rd International Conference on Parallel Processing Workshops, ICPPW 2014
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages206-215
Number of pages10
ISBN (Electronic)9781479956159
DOIs
StatePublished - 7 May 2015
Externally publishedYes
Event43rd International Conference on Parallel Processing Workshops, ICPPW 2014 - Minneapolis, United States
Duration: 9 Sep 201412 Sep 2014

Publication series

NameProceedings of the International Conference on Parallel Processing Workshops
Volume2015-May
ISSN (Print)1530-2016

Conference

Conference43rd International Conference on Parallel Processing Workshops, ICPPW 2014
Country/TerritoryUnited States
CityMinneapolis
Period9/09/1412/09/14

Keywords

  • MPI
  • crash safe
  • debugging
  • detection

Fingerprint

Dive into the research topics of 'MPI Runtime Error Detection with MUST: A Scalable and Crash-Safe Approach'. Together they form a unique fingerprint.

Cite this