Clustering performance data efficiently at massive scales

Todd Gamblin, Bronis R. De Supinski, Martin Schulz, Rob Fowler, Daniel A. Reed

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

24 Scopus citations

Abstract

Existing supercomputers have hundreds of thousands of processor cores, and future systems may have hundreds of millions. Developers need detailed performance measurements to tune their applications and to exploit these systems fully. However, extreme scales pose unique challenges for performance-tuning tools, which can generate significant volumes of I/O. Compute-to-I/O ratios have increased drastically as systems have grown, and the I/O systems of large machines can handle the peak load from only a small fraction of cores. Tool developers need efficient techniques to analyze and to reduce performance data from large numbers of cores. We introduce CAPEK, a novel parallel clustering algorithm that enables in-situ analysis of performance data at run time. Our algorithm scales sub-linearly to 131,072 processes, running in less than one second even at that scale, which is fast enough for on-line use in production runs. The CAPEK implementation is fully generic and can be used for many types of analysis. We demonstrate its application to statistical trace sampling. Specifically, we use our algorithm to compute efficiently stratified sampling strategies for traces at run time. We show that such stratification can result in data-volume reduction of up to four orders of magnitude on current large-scale systems, with potential for greater reductions for future extreme-scale systems.

Original languageEnglish
Title of host publicationICS'10 - 2010 International Conference on Supercomputing
Pages243-252
Number of pages10
DOIs
StatePublished - 2010
Externally publishedYes
Event24th ACM International Conference on Supercomputing, ICS'10 - Tsukuba, Ibaraki, Japan
Duration: 2 Jun 20104 Jun 2010

Publication series

NameProceedings of the International Conference on Supercomputing

Conference

Conference24th ACM International Conference on Supercomputing, ICS'10
Country/TerritoryJapan
CityTsukuba, Ibaraki
Period2/06/104/06/10

Fingerprint

Dive into the research topics of 'Clustering performance data efficiently at massive scales'. Together they form a unique fingerprint.

Cite this