Accurate application progress analysis for large-scale parallel debugging

Subrata Mitra, Ignacio Laguna, Dong H. Ahn, Saurabh Bagchi, Martin Schulz, Todd Gamblin

Publikation: Beitrag in FachzeitschriftArtikelBegutachtung

9 Zitate (Scopus)

Abstract

Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can quickly propagate to other tasks, making it difficult to debug. Finding the least-progressed tasks can significantly reduce the effort to identify the task where the fault originated. However, existing approaches for detecting them suffer low accuracy and large overheads; either they use imprecise static analysis or are unable to infer progress dependence inside loops. We present a loop-aware progress-dependence analysis tool, PRODOMETER, which determines relative progress among parallel tasks via dynamic analysis. Our fault-injection experiments suggest that its accuracy and precision are over 90% for most cases and that it scales well up to 16,384 MPI tasks. Further, our case study shows that it significantly helped diagnosing a perplexing error in MPI, which only manifested at large scale.

OriginalspracheEnglisch
Seiten (von - bis)193-203
Seitenumfang11
FachzeitschriftACM SIGPLAN Notices
Jahrgang49
Ausgabenummer6
DOIs
PublikationsstatusVeröffentlicht - 5 Juni 2014
Extern publiziertJa

Fingerprint

Untersuchen Sie die Forschungsthemen von „Accurate application progress analysis for large-scale parallel debugging“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren