Accurate application progress analysis for large-scale parallel debugging

Subrata Mitra, Ignacio Laguna, Dong H. Ahn, Saurabh Bagchi, Martin Schulz, Todd Gamblin

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

19 Scopus citations

Abstract

Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can quickly propagate to other tasks, making it difficult to debug. Finding the least-progressed tasks can significantly reduce the effort to identify the task where the fault originated. However, existing approaches for detecting them suffer low accuracy and large overheads; either they use imprecise static analysis or are unable to infer progress dependence inside loops. We present a loop-aware progress-dependence analysis tool, PRODOMETER, which determines relative progress among parallel tasks via dynamic analysis. Our fault-injection experiments suggest that its accuracy and precision are over 90% for most cases and that it scales well up to 16,384 MPI tasks. Further, our case study shows that it significantly helped diagnosing a perplexing error in MPI, which only manifested at large scale.

Original languageEnglish
Title of host publicationPLDI 2014 - Proceedings of the 2014 ACM SIGPLAN Conference on Programming Language Design and Implementation
PublisherAssociation for Computing Machinery
Pages193-203
Number of pages11
ISBN (Print)9781450327848
DOIs
StatePublished - 2014
Externally publishedYes
Event35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2014 - Edinburgh, United Kingdom
Duration: 9 Jun 201411 Jun 2014

Publication series

NameProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)

Conference

Conference35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2014
Country/TerritoryUnited Kingdom
CityEdinburgh
Period9/06/1411/06/14

Keywords

  • Dynamic analysis
  • High-performance computing
  • MPI
  • Parallel debugging

Fingerprint

Dive into the research topics of 'Accurate application progress analysis for large-scale parallel debugging'. Together they form a unique fingerprint.

Cite this