Abstract
Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can quickly propagate to other tasks, making it difficult to debug. Finding the least-progressed tasks can significantly reduce the effort to identify the task where the fault originated. However, existing approaches for detecting them suffer low accuracy and large overheads; either they use imprecise static analysis or are unable to infer progress dependence inside loops. We present a loop-aware progress-dependence analysis tool, PRODOMETER, which determines relative progress among parallel tasks via dynamic analysis. Our fault-injection experiments suggest that its accuracy and precision are over 90% for most cases and that it scales well up to 16,384 MPI tasks. Further, our case study shows that it significantly helped diagnosing a perplexing error in MPI, which only manifested at large scale.
| Originalsprache | Englisch |
|---|---|
| Seiten (von - bis) | 193-203 |
| Seitenumfang | 11 |
| Fachzeitschrift | ACM SIGPLAN Notices |
| Jahrgang | 49 |
| Ausgabenummer | 6 |
| DOIs | |
| Publikationsstatus | Veröffentlicht - 5 Juni 2014 |
| Extern publiziert | Ja |