Nderstanding the spatial characteristics of DRAM errors in HPC clusters

Ayush Patwari, Martin Schulz, Ignacio Laguna, Saurabh Bagchi

Publikation: Beitrag in Buch/Bericht/KonferenzbandKonferenzbeitragBegutachtung

9 Zitate (Scopus)

Abstract

Understanding DRAM errors in high-performance computing (HPC) clusters is paramount to address future HPC resilience challenges. While there have been studies on this topic, previous work has focused on on-node and single-rack characteristics of errors; conversely, few studies have presented insights into the spatial behavior of DRAM errors across an entire cluster. Understanding spatial peculiarities of DRAM errors through an entire cluster is crucial for cluster temperature management, job allocation, and failure prediction. In this paper, we study the spatial nature of DRAM errors on data gathered in a large production HPC cluster. Our analysis shows that nodes with high degree of errors are grouped in spatial regions for time periods, suggesting that these "susceptible" regions are collectively more vulnerable to errors than other regions. We then use our observations to build a predictor, which identifies such regions given prior neighboring regions patterns.

OriginalspracheEnglisch
TitelFTXS 2017 - Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, co-located with HPDC 2017
Herausgeber (Verlag)Association for Computing Machinery, Inc
Seiten17-22
Seitenumfang6
ISBN (elektronisch)9781450350013
DOIs
PublikationsstatusVeröffentlicht - 26 Juni 2017
Extern publiziertJa
Veranstaltung7th Fault Tolerance for HPC at eXtreme Scale Workshop, FTXS 2017 - Washington, USA/Vereinigte Staaten
Dauer: 26 Juni 201730 Juni 2017

Publikationsreihe

NameFTXS 2017 - Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, co-located with HPDC 2017

Konferenz

Konferenz7th Fault Tolerance for HPC at eXtreme Scale Workshop, FTXS 2017
Land/GebietUSA/Vereinigte Staaten
OrtWashington
Zeitraum26/06/1730/06/17

Fingerprint

Untersuchen Sie die Forschungsthemen von „Nderstanding the spatial characteristics of DRAM errors in HPC clusters“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren