Abstract
Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can quickly propagate to other tasks, making it difficult to debug. Finding the least-progressed tasks can significantly reduce the effort to identify the task where the fault originated. However, existing approaches for detecting them suffer low accuracy and large overheads; either they use imprecise static analysis or are unable to infer progress dependence inside loops. We present a loop-aware progress-dependence analysis tool, PRODOMETER, which determines relative progress among parallel tasks via dynamic analysis. Our fault-injection experiments suggest that its accuracy and precision are over 90% for most cases and that it scales well up to 16,384 MPI tasks. Further, our case study shows that it significantly helped diagnosing a perplexing error in MPI, which only manifested at large scale.
Originalsprache | Englisch |
---|---|
Seiten (von - bis) | 193-203 |
Seitenumfang | 11 |
Fachzeitschrift | ACM SIGPLAN Notices |
Jahrgang | 49 |
Ausgabenummer | 6 |
DOIs | |
Publikationsstatus | Veröffentlicht - 5 Juni 2014 |
Extern publiziert | Ja |