TY - GEN
T1 - Nderstanding the spatial characteristics of DRAM errors in HPC clusters
AU - Patwari, Ayush
AU - Schulz, Martin
AU - Laguna, Ignacio
AU - Bagchi, Saurabh
N1 - Publisher Copyright:
© 2017 ACM.
PY - 2017/6/26
Y1 - 2017/6/26
N2 - Understanding DRAM errors in high-performance computing (HPC) clusters is paramount to address future HPC resilience challenges. While there have been studies on this topic, previous work has focused on on-node and single-rack characteristics of errors; conversely, few studies have presented insights into the spatial behavior of DRAM errors across an entire cluster. Understanding spatial peculiarities of DRAM errors through an entire cluster is crucial for cluster temperature management, job allocation, and failure prediction. In this paper, we study the spatial nature of DRAM errors on data gathered in a large production HPC cluster. Our analysis shows that nodes with high degree of errors are grouped in spatial regions for time periods, suggesting that these "susceptible" regions are collectively more vulnerable to errors than other regions. We then use our observations to build a predictor, which identifies such regions given prior neighboring regions patterns.
AB - Understanding DRAM errors in high-performance computing (HPC) clusters is paramount to address future HPC resilience challenges. While there have been studies on this topic, previous work has focused on on-node and single-rack characteristics of errors; conversely, few studies have presented insights into the spatial behavior of DRAM errors across an entire cluster. Understanding spatial peculiarities of DRAM errors through an entire cluster is crucial for cluster temperature management, job allocation, and failure prediction. In this paper, we study the spatial nature of DRAM errors on data gathered in a large production HPC cluster. Our analysis shows that nodes with high degree of errors are grouped in spatial regions for time periods, suggesting that these "susceptible" regions are collectively more vulnerable to errors than other regions. We then use our observations to build a predictor, which identifies such regions given prior neighboring regions patterns.
KW - Computer systems organization
KW - Hardware
KW - Reliability
KW - Transient errors and upsets
UR - http://www.scopus.com/inward/record.url?scp=85025823702&partnerID=8YFLogxK
U2 - 10.1145/3086157.3086164
DO - 10.1145/3086157.3086164
M3 - Conference contribution
AN - SCOPUS:85025823702
T3 - FTXS 2017 - Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, co-located with HPDC 2017
SP - 17
EP - 22
BT - FTXS 2017 - Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, co-located with HPDC 2017
PB - Association for Computing Machinery, Inc
T2 - 7th Fault Tolerance for HPC at eXtreme Scale Workshop, FTXS 2017
Y2 - 26 June 2017 through 30 June 2017
ER -