Identifying the Culprits behind Network Congestion

Abhinav Bhatele, Andrew R. Titus, Jayaraman J. Thiagarajan, Nikhil Jain, Todd Gamblin, Peer Timo Bremer, Martin Schulz, Laxmikant V. Kale

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

37 Scopus citations

Abstract

Network congestion is one of the primary causes of performance degradation, performance variability and poor scaling in communication-heavy parallel applications. However, the causes and mechanisms of network congestion on modern interconnection networks are not well understood. We need new approaches to analyze, model and predict this critical behaviour in order to improve the performance of large-scale parallel applications. This paper applies supervised learning algorithms, such as forests of extremely randomized trees and gradient boosted regression trees, to perform regression analysis on communication data and application execution time. Using data derived from multiple executions, we create models to predict the execution time of communication-heavy parallel applications. This analysis also identifies the features and associated hardware components that have the most impact on network congestion and intern, on execution time. The ideas presented in this paper have wide applicability: predicting the execution time on a different number of nodes, or different input datasets, or even for an unknown code, identifying the best configuration parameters for an application, and finding the root causes of network congestion on different architectures.

Original languageEnglish
Title of host publicationProceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages113-122
Number of pages10
ISBN (Electronic)9781479986484
DOIs
StatePublished - 17 Jul 2015
Externally publishedYes
Event29th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015 - Hyderabad, India
Duration: 25 May 201529 May 2015

Publication series

NameProceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015

Conference

Conference29th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015
Country/TerritoryIndia
CityHyderabad
Period25/05/1529/05/15

Keywords

  • congestion
  • interconnection network
  • machine learning
  • modeling
  • performance prediction
  • root cause

Fingerprint

Dive into the research topics of 'Identifying the Culprits behind Network Congestion'. Together they form a unique fingerprint.

Cite this