Characterizing HPC Performance Variation with Monitoring and Unsupervised Learning

Gence Ozer, Alessio Netti, Daniele Tafani, Martin Schulz

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

As HPC systems grow larger and more complex, characterizing the relationships between their different components and gaining insight on their behavior becomes difficult. In turn, this puts a burden on both system administrators and developers who aim at improving the efficiency and reliability of systems, algorithms and applications. Automated approaches capable of extracting a system’s behavior, as well as identifying anomalies and outliers, are necessary more than ever. In this work we discuss our exploratory study of Bayesian Gaussian mixture models, an unsupervised machine learning technique, to characterize the performance of an HPC system’s components, as well as to identify anomalies, based on sensor data. We propose an algorithmic framework for this purpose, implement it within the DCDB monitoring and operational data analytics system, and present several case studies carried out using data from a production HPC system.

Original languageEnglish
Title of host publicationHigh Performance Computing - ISC High Performance 2020 International Workshops, Revised Selected Papers
EditorsHeike Jagode, Hartwig Anzt, Guido Juckeland, Hatem Ltaief
PublisherSpringer Science and Business Media Deutschland GmbH
Pages280-292
Number of pages13
ISBN (Print)9783030598501
DOIs
StatePublished - 2020
Event35th International Conference on High Performance Computing , ISC High Performance 2020 - Frankfurt am Main, Germany
Duration: 21 Jun 202025 Jun 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12321 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference35th International Conference on High Performance Computing , ISC High Performance 2020
Country/TerritoryGermany
CityFrankfurt am Main
Period21/06/2025/06/20

Keywords

  • Anomaly detection
  • Clustering
  • HPC systems
  • Monitoring
  • Operational data analytics

Fingerprint

Dive into the research topics of 'Characterizing HPC Performance Variation with Monitoring and Unsupervised Learning'. Together they form a unique fingerprint.

Cite this