TY - GEN
T1 - FlipTracker
T2 - 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
AU - Guo, Luanzheng
AU - Li, Dong
AU - Laguna, Ignacio
AU - Schulz, Martin
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - As high-performance computing systems scale in size and computational power, the danger of silent errors, i.e., errors that can bypass hardware detection mechanisms and impact application state, grows dramatically. Consequently, applications running on HPC systems need to exhibit resilience to such errors. Previous work has found that, for certain codes, this resilience can come for free, i.e., some applications are naturally resilient, but few studies have shown the code patterns - combinations or sequences of computations - that make an application naturally resilient. In this paper, we present FlipTracker, a framework designed to extract these patterns using fine-grained tracking of error propagation and resilience properties, and we use it to present a set of computation patterns that are responsible for making representative HPC applications naturally resilient to errors. This not only enables a deeper understanding of resilience properties of these codes, but also can guide future application designs towards patterns with natural resilience.
AB - As high-performance computing systems scale in size and computational power, the danger of silent errors, i.e., errors that can bypass hardware detection mechanisms and impact application state, grows dramatically. Consequently, applications running on HPC systems need to exhibit resilience to such errors. Previous work has found that, for certain codes, this resilience can come for free, i.e., some applications are naturally resilient, but few studies have shown the code patterns - combinations or sequences of computations - that make an application naturally resilient. In this paper, we present FlipTracker, a framework designed to extract these patterns using fine-grained tracking of error propagation and resilience properties, and we use it to present a set of computation patterns that are responsible for making representative HPC applications naturally resilient to errors. This not only enables a deeper understanding of resilience properties of these codes, but also can guide future application designs towards patterns with natural resilience.
KW - Fault tolerance
KW - High-Performance Computing
KW - Natural Resilience
KW - Resilience computation patterns
UR - http://www.scopus.com/inward/record.url?scp=85064114291&partnerID=8YFLogxK
U2 - 10.1109/SC.2018.00011
DO - 10.1109/SC.2018.00011
M3 - Conference contribution
AN - SCOPUS:85064114291
T3 - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
SP - 94
EP - 107
BT - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 11 November 2018 through 16 November 2018
ER -