TY - GEN
T1 - Fault resilience of the algebraic multi-grid solver
AU - Casas, Marc
AU - De Supinski, Bronis R.
AU - Bronevetsky, Greg
AU - Schulz, Martin
PY - 2012
Y1 - 2012
N2 - As HPC system sizes grow to millions of cores and chip feature sizes continue to decrease, HPC applications become increasingly exposed to transient hardware faults. These faults can cause aborts and performance degradation. Most importantly, they can corrupt results. Thus, we must evaluate the fault vulnerability of key HPC algorithms to develop cost-effective techniques to improve application resilience. We present an approach that analyzes the vulnerability of applications to faults, systematically reduces it by protecting the most vulnerable components and predicts application vulnerability at large scales. We initially focus on sparse scientific applications and apply our approach in this paper to the Algebraic Multi Grid (AMG) algorithm. We empirically analyze AMG's vulnerability to hardware faults in both sequential and parallel (hybrid MPI/OpenMP) executions on up to 1,600 cores and propose and evaluate the use of targeted pointer replication to reduce it. Our techniques increase AMG's resilience to transient hardware faults by 50-80% and improve its scalability on faulty computational environments by 35%. Further, we show how to model AMG's scalability in fault-prone environments to predict execution times of large-scale runs accurately.
AB - As HPC system sizes grow to millions of cores and chip feature sizes continue to decrease, HPC applications become increasingly exposed to transient hardware faults. These faults can cause aborts and performance degradation. Most importantly, they can corrupt results. Thus, we must evaluate the fault vulnerability of key HPC algorithms to develop cost-effective techniques to improve application resilience. We present an approach that analyzes the vulnerability of applications to faults, systematically reduces it by protecting the most vulnerable components and predicts application vulnerability at large scales. We initially focus on sparse scientific applications and apply our approach in this paper to the Algebraic Multi Grid (AMG) algorithm. We empirically analyze AMG's vulnerability to hardware faults in both sequential and parallel (hybrid MPI/OpenMP) executions on up to 1,600 cores and propose and evaluate the use of targeted pointer replication to reduce it. Our techniques increase AMG's resilience to transient hardware faults by 50-80% and improve its scalability on faulty computational environments by 35%. Further, we show how to model AMG's scalability in fault-prone environments to predict execution times of large-scale runs accurately.
KW - Algebraic Multi-Grid solver
KW - Resilience
KW - Transient faults
UR - http://www.scopus.com/inward/record.url?scp=84864068316&partnerID=8YFLogxK
U2 - 10.1145/2304576.2304590
DO - 10.1145/2304576.2304590
M3 - Conference contribution
AN - SCOPUS:84864068316
SN - 9781450313162
T3 - Proceedings of the International Conference on Supercomputing
SP - 91
EP - 100
BT - ICS'12 - Proceedings of the 2012 ACM International Conference on Supercomputing
T2 - 26th ACM International Conference on Supercomputing, ICS'12
Y2 - 25 June 2012 through 29 June 2012
ER -