TY - JOUR
T1 - Adaptive control in roll-forward recovery for extreme scale multigrid
AU - Huber, Markus
AU - Rüde, Ulrich
AU - Wohlmuth, Barbara
N1 - Publisher Copyright:
© The Author(s) 2018.
PY - 2019/9/1
Y1 - 2019/9/1
N2 - With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution process, while the solution in the faulty domain is reconstructed by an asynchronous online recovery. The computations in both the faulty and the healthy subdomains must be coordinated in a sensitive way, in particular, both under- and over-solving must be avoided. Both of these waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal recoupling, we introduce a stopping criterion based on a mathematical error estimator. It involves hierarchically weighted sums of residuals within the context of uniformly refined meshes and is well-suited in the context of parallel high-performance computing. The recoupling process is steered by local contributions of the error estimator before the fault. Failure scenarios when solving up to 6.9 × 1011 unknowns on more than 245,766 parallel processes will be reported on a state-of-the-art peta-scale supercomputer demonstrating the robustness of the method.
AB - With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution process, while the solution in the faulty domain is reconstructed by an asynchronous online recovery. The computations in both the faulty and the healthy subdomains must be coordinated in a sensitive way, in particular, both under- and over-solving must be avoided. Both of these waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal recoupling, we introduce a stopping criterion based on a mathematical error estimator. It involves hierarchically weighted sums of residuals within the context of uniformly refined meshes and is well-suited in the context of parallel high-performance computing. The recoupling process is steered by local contributions of the error estimator before the fault. Failure scenarios when solving up to 6.9 × 1011 unknowns on more than 245,766 parallel processes will be reported on a state-of-the-art peta-scale supercomputer demonstrating the robustness of the method.
KW - Algorithm-based fault tolerance
KW - adaptive recovery
KW - error estimator
KW - high-performance computing
KW - multigrid methods
UR - http://www.scopus.com/inward/record.url?scp=85059683835&partnerID=8YFLogxK
U2 - 10.1177/1094342018817088
DO - 10.1177/1094342018817088
M3 - Article
AN - SCOPUS:85059683835
SN - 1094-3420
VL - 33
SP - 817
EP - 837
JO - International Journal of High Performance Computing Applications
JF - International Journal of High Performance Computing Applications
IS - 5
ER -