TY - GEN
T1 - Mechanisms and evaluation of cross-layer fault-tolerance for supercomputing
AU - Ho, Chen Han
AU - De Kruijf, Marc
AU - Sankaralingam, Karthikeyan
AU - Rountree, Barry
AU - Schulz, Martin
AU - De Supinski, Bronis R.
PY - 2012
Y1 - 2012
N2 - Reliability is emerging as an important constraint for future microprocessors. Cooperative hardware and software approaches for error tolerance can solve this hardware reliability challenge. Cross-layer fault tolerance frameworks expose hardware failures to upper-layers, like the compiler, to help correct faults. Such cooperative approaches require less hardware complexity than masking all faults at the hardware level and are generally more energy efficient. This paper provides a detailed design and an implementation study of cross-layer fault tolerance for supercomputing. Since supercomputers necessarily involve large component counts, they have more frequent failures than consumer electronics and small systems. Conventionally, these systems use redundancy and check pointing to achieve reliable computing. However, redundancy increases acquisition as well as recurring energy costs. This paper describes a simple language-level mechanism coupled with complementary compilation and lightweight hardware error detection that provides efficient reliability and cross-layer fault-tolerance for supercomputers. Our evaluation focuses on strong scaling problems for which we can trade computing power for redundancy. Our results show a range of 1.07x to 2.5x speedup when employing cross-layer error-tolerance compared to conventional full dual modular redundancy (DMR) to contain all errors within hardware. Further, we demonstrate the approach can sustain 7% to 50% lower energy. The most important result of this work is qualitative: we can use a simplified hardware design with relaxed architectural correctness guarantees.
AB - Reliability is emerging as an important constraint for future microprocessors. Cooperative hardware and software approaches for error tolerance can solve this hardware reliability challenge. Cross-layer fault tolerance frameworks expose hardware failures to upper-layers, like the compiler, to help correct faults. Such cooperative approaches require less hardware complexity than masking all faults at the hardware level and are generally more energy efficient. This paper provides a detailed design and an implementation study of cross-layer fault tolerance for supercomputing. Since supercomputers necessarily involve large component counts, they have more frequent failures than consumer electronics and small systems. Conventionally, these systems use redundancy and check pointing to achieve reliable computing. However, redundancy increases acquisition as well as recurring energy costs. This paper describes a simple language-level mechanism coupled with complementary compilation and lightweight hardware error detection that provides efficient reliability and cross-layer fault-tolerance for supercomputers. Our evaluation focuses on strong scaling problems for which we can trade computing power for redundancy. Our results show a range of 1.07x to 2.5x speedup when employing cross-layer error-tolerance compared to conventional full dual modular redundancy (DMR) to contain all errors within hardware. Further, we demonstrate the approach can sustain 7% to 50% lower energy. The most important result of this work is qualitative: we can use a simplified hardware design with relaxed architectural correctness guarantees.
KW - Cross-Layer Fault Tolerance
KW - HPC
KW - Reliability
KW - Supercomputing
UR - http://www.scopus.com/inward/record.url?scp=84871176503&partnerID=8YFLogxK
U2 - 10.1109/ICPP.2012.37
DO - 10.1109/ICPP.2012.37
M3 - Conference contribution
AN - SCOPUS:84871176503
SN - 9780769547961
T3 - Proceedings of the International Conference on Parallel Processing
SP - 510
EP - 519
BT - Proceedings - 41st International Conference on Parallel Processing, ICPP 2012
T2 - 41st International Conference on Parallel Processing, ICPP 2012
Y2 - 10 September 2012 through 13 September 2012
ER -