TY - GEN
T1 - Identifying the Culprits behind Network Congestion
AU - Bhatele, Abhinav
AU - Titus, Andrew R.
AU - Thiagarajan, Jayaraman J.
AU - Jain, Nikhil
AU - Gamblin, Todd
AU - Bremer, Peer Timo
AU - Schulz, Martin
AU - Kale, Laxmikant V.
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/7/17
Y1 - 2015/7/17
N2 - Network congestion is one of the primary causes of performance degradation, performance variability and poor scaling in communication-heavy parallel applications. However, the causes and mechanisms of network congestion on modern interconnection networks are not well understood. We need new approaches to analyze, model and predict this critical behaviour in order to improve the performance of large-scale parallel applications. This paper applies supervised learning algorithms, such as forests of extremely randomized trees and gradient boosted regression trees, to perform regression analysis on communication data and application execution time. Using data derived from multiple executions, we create models to predict the execution time of communication-heavy parallel applications. This analysis also identifies the features and associated hardware components that have the most impact on network congestion and intern, on execution time. The ideas presented in this paper have wide applicability: predicting the execution time on a different number of nodes, or different input datasets, or even for an unknown code, identifying the best configuration parameters for an application, and finding the root causes of network congestion on different architectures.
AB - Network congestion is one of the primary causes of performance degradation, performance variability and poor scaling in communication-heavy parallel applications. However, the causes and mechanisms of network congestion on modern interconnection networks are not well understood. We need new approaches to analyze, model and predict this critical behaviour in order to improve the performance of large-scale parallel applications. This paper applies supervised learning algorithms, such as forests of extremely randomized trees and gradient boosted regression trees, to perform regression analysis on communication data and application execution time. Using data derived from multiple executions, we create models to predict the execution time of communication-heavy parallel applications. This analysis also identifies the features and associated hardware components that have the most impact on network congestion and intern, on execution time. The ideas presented in this paper have wide applicability: predicting the execution time on a different number of nodes, or different input datasets, or even for an unknown code, identifying the best configuration parameters for an application, and finding the root causes of network congestion on different architectures.
KW - congestion
KW - interconnection network
KW - machine learning
KW - modeling
KW - performance prediction
KW - root cause
UR - http://www.scopus.com/inward/record.url?scp=84971449350&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2015.92
DO - 10.1109/IPDPS.2015.92
M3 - Conference contribution
AN - SCOPUS:84971449350
T3 - Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015
SP - 113
EP - 122
BT - Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 29th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015
Y2 - 25 May 2015 through 29 May 2015
ER -