TY - GEN
T1 - Predicting faults in high performance computing systems
T2 - 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019
AU - Jauk, David
AU - Yang, Dai
AU - Schulz, Martin
N1 - Publisher Copyright:
© 2019 ACM.
PY - 2019/11/17
Y1 - 2019/11/17
N2 - As we near exascale, resilience remains a major technical hurdle. Any technique with the goal of achieving resilience suffers from having to be reactive, as failures can appear at any time. A wide body of research aims at predicting failures, i.e., forecasting failures so that evasive actions can be taken while the system is still fully functional, which has the benefit of giving insight into the global system state. This research area has grown very diverse with a large number of approaches, yet is currently poorly classified, making it hard to understand the impact and coverage of existing work. In this paper, we perform an extensive survey of existing literature in failure prediction by analyzing and comparing more than 30 different failure prediction approaches. We develop a taxonomy, which aids in categorizing the methods, and we show how this can help us to understand the state-of-the-practice of this field and to identify opportunities, gaps as well as future work.
AB - As we near exascale, resilience remains a major technical hurdle. Any technique with the goal of achieving resilience suffers from having to be reactive, as failures can appear at any time. A wide body of research aims at predicting failures, i.e., forecasting failures so that evasive actions can be taken while the system is still fully functional, which has the benefit of giving insight into the global system state. This research area has grown very diverse with a large number of approaches, yet is currently poorly classified, making it hard to understand the impact and coverage of existing work. In this paper, we perform an extensive survey of existing literature in failure prediction by analyzing and comparing more than 30 different failure prediction approaches. We develop a taxonomy, which aids in categorizing the methods, and we show how this can help us to understand the state-of-the-practice of this field and to identify opportunities, gaps as well as future work.
KW - Exascale computing
KW - Fault prediction
KW - High performance computing
KW - Resillience
UR - http://www.scopus.com/inward/record.url?scp=85076254771&partnerID=8YFLogxK
U2 - 10.1145/3295500.3356185
DO - 10.1145/3295500.3356185
M3 - Conference contribution
AN - SCOPUS:85076254771
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2019
PB - IEEE Computer Society
Y2 - 17 November 2019 through 22 November 2019
ER -