TY - GEN
T1 - Reducing False Node Failure Predictions in HPC
AU - Frank, Alvaro
AU - Yang, Dai
AU - Brinkmann, Andre
AU - Schulz, Martin
AU - Suss, Tim
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/12
Y1 - 2019/12
N2 - Future HPC applications must be able to scale to thousands of compute nodes, while running for several days. The increased runtime and node count inconveniently raises the probability of hardware failures that may interrupt computations. Scientists must therefore protect their simulations against hardware failures. This is typically done using frequent checkpoint& restart, which may have significant overheads. Consequently, the frequency in which checkpoints are taken should be minimized. Predicting hardware failures ahead of time is a promising approach to address this problem, but has remaining issues like false alarms at large scales. In this paper, we introduce the probability of unnecessarily triggering checkpoints (UC) to evaluate the quality of node level failure predictors for checkpointing large-scale applications. This metric is used to show how current predictors suffer from too many false alarms at large node counts. Further, we propose a new failure predictor that chains several machine learning classifiers to make predictions with minimal false alarms. We aim for extremely low false positive rates to guarantee that no unnecessary checkpoints will be performed even for very large node counts. Our experiments based on real system traces from a large production cluster show that our predictor achieves a lead-up time of four minutes, a recall of 0.7302, a false positive rate of 0.0004, a precision of 0.9944 and a probability of unnecessary checkpoints (UC) of 0.00011 for 1024 nodes.
AB - Future HPC applications must be able to scale to thousands of compute nodes, while running for several days. The increased runtime and node count inconveniently raises the probability of hardware failures that may interrupt computations. Scientists must therefore protect their simulations against hardware failures. This is typically done using frequent checkpoint& restart, which may have significant overheads. Consequently, the frequency in which checkpoints are taken should be minimized. Predicting hardware failures ahead of time is a promising approach to address this problem, but has remaining issues like false alarms at large scales. In this paper, we introduce the probability of unnecessarily triggering checkpoints (UC) to evaluate the quality of node level failure predictors for checkpointing large-scale applications. This metric is used to show how current predictors suffer from too many false alarms at large node counts. Further, we propose a new failure predictor that chains several machine learning classifiers to make predictions with minimal false alarms. We aim for extremely low false positive rates to guarantee that no unnecessary checkpoints will be performed even for very large node counts. Our experiments based on real system traces from a large production cluster show that our predictor achieves a lead-up time of four minutes, a recall of 0.7302, a false positive rate of 0.0004, a precision of 0.9944 and a probability of unnecessary checkpoints (UC) of 0.00011 for 1024 nodes.
KW - failure prediction
KW - false positives
KW - fault tolerance
KW - resilience
UR - http://www.scopus.com/inward/record.url?scp=85080142773&partnerID=8YFLogxK
U2 - 10.1109/HiPC.2019.00047
DO - 10.1109/HiPC.2019.00047
M3 - Conference contribution
AN - SCOPUS:85080142773
T3 - Proceedings - 26th IEEE International Conference on High Performance Computing, HiPC 2019
SP - 323
EP - 332
BT - Proceedings - 26th IEEE International Conference on High Performance Computing, HiPC 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 26th Annual IEEE International Conference on High Performance Computing, HiPC 2019
Y2 - 17 December 2019 through 20 December 2019
ER -