TY - GEN
T1 - A study of application-level recovery methods for transient network faults
AU - Laguna, Ignacio
AU - León, Edgar A.
AU - Schulz, Martin
AU - Stephenson, Mark
PY - 2013
Y1 - 2013
N2 - With the increasing number of components in HPC sys- tems, transient faults will become commonplace. Today, network transient faults, such as lost or corrupted network packets, are addressed by middleware libraries at the cost of high memory usage and packet retransmissions. These costs, however, can be eliminated using application-level fault tolerance. In this paper, we propose recovery methods for transient network faults at the application level. These methods reconstruct missing or corrupted data via interpolation. We derive a realistic fault model using network fault rates from a production HPC cluster and use it to demonstrate the effectiveness of our reconstruction methods in an FFT kernel. We found that the normalized root-mean-square error for FFT computations can be as low as 0.1% and, thus, demonstrates that network faults can be handled at the application level with low perturbation in applications that have smoothness in their computed data.
AB - With the increasing number of components in HPC sys- tems, transient faults will become commonplace. Today, network transient faults, such as lost or corrupted network packets, are addressed by middleware libraries at the cost of high memory usage and packet retransmissions. These costs, however, can be eliminated using application-level fault tolerance. In this paper, we propose recovery methods for transient network faults at the application level. These methods reconstruct missing or corrupted data via interpolation. We derive a realistic fault model using network fault rates from a production HPC cluster and use it to demonstrate the effectiveness of our reconstruction methods in an FFT kernel. We found that the normalized root-mean-square error for FFT computations can be as low as 0.1% and, thus, demonstrates that network faults can be handled at the application level with low perturbation in applications that have smoothness in their computed data.
KW - Application-level fault recovery
KW - Network faults
KW - Resilience
UR - http://www.scopus.com/inward/record.url?scp=84892910477&partnerID=8YFLogxK
U2 - 10.1145/2530268.2530271
DO - 10.1145/2530268.2530271
M3 - Conference contribution
AN - SCOPUS:84892910477
SN - 9781450325080
T3 - Proc. of ScalA 2013: Workshop on Latest Adv. in Scalable Algorithms for Large-Scale Systems - Held in Conjunction with SC 2013: The Int. Conf. for High Perform. Comput., Networking, Storage and Anal.
BT - Proc. of ScalA 2013
T2 - Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2013 - Held in Conjunction with the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013
Y2 - 17 November 2013 through 21 November 2013
ER -