Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice

David Jauk, Dai Yang, Martin Schulz

Publikation: Beitrag in Buch/Bericht/KonferenzbandKonferenzbeitragBegutachtung

24 Zitate (Scopus)

Abstract

As we near exascale, resilience remains a major technical hurdle. Any technique with the goal of achieving resilience suffers from having to be reactive, as failures can appear at any time. A wide body of research aims at predicting failures, i.e., forecasting failures so that evasive actions can be taken while the system is still fully functional, which has the benefit of giving insight into the global system state. This research area has grown very diverse with a large number of approaches, yet is currently poorly classified, making it hard to understand the impact and coverage of existing work. In this paper, we perform an extensive survey of existing literature in failure prediction by analyzing and comparing more than 30 different failure prediction approaches. We develop a taxonomy, which aids in categorizing the methods, and we show how this can help us to understand the state-of-the-practice of this field and to identify opportunities, gaps as well as future work.

OriginalspracheEnglisch
TitelProceedings of SC 2019
UntertitelThe International Conference for High Performance Computing, Networking, Storage and Analysis
Herausgeber (Verlag)IEEE Computer Society
ISBN (elektronisch)9781450362290
DOIs
PublikationsstatusVeröffentlicht - 17 Nov. 2019
Veranstaltung2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019 - Denver, USA/Vereinigte Staaten
Dauer: 17 Nov. 201922 Nov. 2019

Publikationsreihe

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
ISSN (Print)2167-4329
ISSN (elektronisch)2167-4337

Konferenz

Konferenz2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019
Land/GebietUSA/Vereinigte Staaten
OrtDenver
Zeitraum17/11/1922/11/19

Fingerprint

Untersuchen Sie die Forschungsthemen von „Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren