TY - GEN
T1 - Accurate application progress analysis for large-scale parallel debugging
AU - Mitra, Subrata
AU - Laguna, Ignacio
AU - Ahn, Dong H.
AU - Bagchi, Saurabh
AU - Schulz, Martin
AU - Gamblin, Todd
PY - 2014
Y1 - 2014
N2 - Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can quickly propagate to other tasks, making it difficult to debug. Finding the least-progressed tasks can significantly reduce the effort to identify the task where the fault originated. However, existing approaches for detecting them suffer low accuracy and large overheads; either they use imprecise static analysis or are unable to infer progress dependence inside loops. We present a loop-aware progress-dependence analysis tool, PRODOMETER, which determines relative progress among parallel tasks via dynamic analysis. Our fault-injection experiments suggest that its accuracy and precision are over 90% for most cases and that it scales well up to 16,384 MPI tasks. Further, our case study shows that it significantly helped diagnosing a perplexing error in MPI, which only manifested at large scale.
AB - Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can quickly propagate to other tasks, making it difficult to debug. Finding the least-progressed tasks can significantly reduce the effort to identify the task where the fault originated. However, existing approaches for detecting them suffer low accuracy and large overheads; either they use imprecise static analysis or are unable to infer progress dependence inside loops. We present a loop-aware progress-dependence analysis tool, PRODOMETER, which determines relative progress among parallel tasks via dynamic analysis. Our fault-injection experiments suggest that its accuracy and precision are over 90% for most cases and that it scales well up to 16,384 MPI tasks. Further, our case study shows that it significantly helped diagnosing a perplexing error in MPI, which only manifested at large scale.
KW - Dynamic analysis
KW - High-performance computing
KW - MPI
KW - Parallel debugging
UR - http://www.scopus.com/inward/record.url?scp=84901627203&partnerID=8YFLogxK
U2 - 10.1145/2594291.2594336
DO - 10.1145/2594291.2594336
M3 - Conference contribution
AN - SCOPUS:84901627203
SN - 9781450327848
T3 - Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)
SP - 193
EP - 203
BT - PLDI 2014 - Proceedings of the 2014 ACM SIGPLAN Conference on Programming Language Design and Implementation
PB - Association for Computing Machinery
T2 - 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2014
Y2 - 9 June 2014 through 11 June 2014
ER -