Nderstanding the spatial characteristics of DRAM errors in HPC clusters

Ayush Patwari, Martin Schulz, Ignacio Laguna, Saurabh Bagchi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

Understanding DRAM errors in high-performance computing (HPC) clusters is paramount to address future HPC resilience challenges. While there have been studies on this topic, previous work has focused on on-node and single-rack characteristics of errors; conversely, few studies have presented insights into the spatial behavior of DRAM errors across an entire cluster. Understanding spatial peculiarities of DRAM errors through an entire cluster is crucial for cluster temperature management, job allocation, and failure prediction. In this paper, we study the spatial nature of DRAM errors on data gathered in a large production HPC cluster. Our analysis shows that nodes with high degree of errors are grouped in spatial regions for time periods, suggesting that these "susceptible" regions are collectively more vulnerable to errors than other regions. We then use our observations to build a predictor, which identifies such regions given prior neighboring regions patterns.

Original languageEnglish
Title of host publicationFTXS 2017 - Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, co-located with HPDC 2017
PublisherAssociation for Computing Machinery, Inc
Pages17-22
Number of pages6
ISBN (Electronic)9781450350013
DOIs
StatePublished - 26 Jun 2017
Externally publishedYes
Event7th Fault Tolerance for HPC at eXtreme Scale Workshop, FTXS 2017 - Washington, United States
Duration: 26 Jun 201730 Jun 2017

Publication series

NameFTXS 2017 - Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, co-located with HPDC 2017

Conference

Conference7th Fault Tolerance for HPC at eXtreme Scale Workshop, FTXS 2017
Country/TerritoryUnited States
CityWashington
Period26/06/1730/06/17

Keywords

  • Computer systems organization
  • Hardware
  • Reliability
  • Transient errors and upsets

Fingerprint

Dive into the research topics of 'Nderstanding the spatial characteristics of DRAM errors in HPC clusters'. Together they form a unique fingerprint.

Cite this