MUST: A scalable approach to runtime error detection in MPI programs

Tobias Hilbrich, Martin Schulz, Bronis R. De Supinski, Matthias S. Müller

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

52 Scopus citations

Abstract

The Message-Passing Interface (MPI) is large and complex. Therefore, programming MPI is error prone. Several MPI runtime correctness tools address classes of usage errors, such as deadlocks or non-portable constructs. To our knowledge none of these tools scales to more than about 100 processes. However, some of the current HPC systems use more than 100,000 cores and future systems are expected to use far more. Since errors often depend on the task count used, we need correctness tools that scale to the full system size.We present a novel framework for scalable MPI correctness tools to address this need. Our fine-grained, module-based approach supports rapid prototyping and allows correctness tools built upon it to adapt to different architectures and use cases. The design uses PnMPI to instantiate a tool from a set of individual modules.We present an overview of our design, along with first performance results for a proof of concept implementation.

Original languageEnglish
Title of host publicationProceedings of the 3rd International Workshop on Parallel Tools for High Performance Computing 2009
PublisherSpringer Verlag
Pages53-66
Number of pages14
ISBN (Print)9783642112607
DOIs
StatePublished - 2010
Externally publishedYes
Event3rd International Workshop on Parallel Tools for High Performance Computing, HPC 2009 - Dresden, Germany
Duration: 14 Sep 200915 Sep 2009

Publication series

NameProceedings of the 3rd International Workshop on Parallel Tools for High Performance Computing 2009

Conference

Conference3rd International Workshop on Parallel Tools for High Performance Computing, HPC 2009
Country/TerritoryGermany
CityDresden
Period14/09/0915/09/09

Fingerprint

Dive into the research topics of 'MUST: A scalable approach to runtime error detection in MPI programs'. Together they form a unique fingerprint.

Cite this