Reducing False Node Failure Predictions in HPC

Alvaro Frank, Dai Yang, Andre Brinkmann, Martin Schulz, Tim Suss

Publikation: Beitrag in Buch/Bericht/KonferenzbandKonferenzbeitragBegutachtung

9 Zitate (Scopus)

Abstract

Future HPC applications must be able to scale to thousands of compute nodes, while running for several days. The increased runtime and node count inconveniently raises the probability of hardware failures that may interrupt computations. Scientists must therefore protect their simulations against hardware failures. This is typically done using frequent checkpoint& restart, which may have significant overheads. Consequently, the frequency in which checkpoints are taken should be minimized. Predicting hardware failures ahead of time is a promising approach to address this problem, but has remaining issues like false alarms at large scales. In this paper, we introduce the probability of unnecessarily triggering checkpoints (UC) to evaluate the quality of node level failure predictors for checkpointing large-scale applications. This metric is used to show how current predictors suffer from too many false alarms at large node counts. Further, we propose a new failure predictor that chains several machine learning classifiers to make predictions with minimal false alarms. We aim for extremely low false positive rates to guarantee that no unnecessary checkpoints will be performed even for very large node counts. Our experiments based on real system traces from a large production cluster show that our predictor achieves a lead-up time of four minutes, a recall of 0.7302, a false positive rate of 0.0004, a precision of 0.9944 and a probability of unnecessary checkpoints (UC) of 0.00011 for 1024 nodes.

OriginalspracheEnglisch
TitelProceedings - 26th IEEE International Conference on High Performance Computing, HiPC 2019
Herausgeber (Verlag)Institute of Electrical and Electronics Engineers Inc.
Seiten323-332
Seitenumfang10
ISBN (elektronisch)9781728145358
DOIs
PublikationsstatusVeröffentlicht - Dez. 2019
Extern publiziertJa
Veranstaltung26th Annual IEEE International Conference on High Performance Computing, HiPC 2019 - Hyderabad, Indien
Dauer: 17 Dez. 201920 Dez. 2019

Publikationsreihe

NameProceedings - 26th IEEE International Conference on High Performance Computing, HiPC 2019

Konferenz

Konferenz26th Annual IEEE International Conference on High Performance Computing, HiPC 2019
Land/GebietIndien
OrtHyderabad
Zeitraum17/12/1920/12/19

Fingerprint

Untersuchen Sie die Forschungsthemen von „Reducing False Node Failure Predictions in HPC“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren