Characterizing HPC Performance Variation with Monitoring and Unsupervised Learning

Gence Ozer, Alessio Netti, Daniele Tafani, Martin Schulz

Publikation: Beitrag in Buch/Bericht/KonferenzbandKonferenzbeitragBegutachtung

4 Zitate (Scopus)

Abstract

As HPC systems grow larger and more complex, characterizing the relationships between their different components and gaining insight on their behavior becomes difficult. In turn, this puts a burden on both system administrators and developers who aim at improving the efficiency and reliability of systems, algorithms and applications. Automated approaches capable of extracting a system’s behavior, as well as identifying anomalies and outliers, are necessary more than ever. In this work we discuss our exploratory study of Bayesian Gaussian mixture models, an unsupervised machine learning technique, to characterize the performance of an HPC system’s components, as well as to identify anomalies, based on sensor data. We propose an algorithmic framework for this purpose, implement it within the DCDB monitoring and operational data analytics system, and present several case studies carried out using data from a production HPC system.

OriginalspracheEnglisch
TitelHigh Performance Computing - ISC High Performance 2020 International Workshops, Revised Selected Papers
Redakteure/-innenHeike Jagode, Hartwig Anzt, Guido Juckeland, Hatem Ltaief
Herausgeber (Verlag)Springer Science and Business Media Deutschland GmbH
Seiten280-292
Seitenumfang13
ISBN (Print)9783030598501
DOIs
PublikationsstatusVeröffentlicht - 2020
Veranstaltung35th International Conference on High Performance Computing , ISC High Performance 2020 - Frankfurt am Main, Deutschland
Dauer: 21 Juni 202025 Juni 2020

Publikationsreihe

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Band12321 LNCS
ISSN (Print)0302-9743
ISSN (elektronisch)1611-3349

Konferenz

Konferenz35th International Conference on High Performance Computing , ISC High Performance 2020
Land/GebietDeutschland
OrtFrankfurt am Main
Zeitraum21/06/2025/06/20

Fingerprint

Untersuchen Sie die Forschungsthemen von „Characterizing HPC Performance Variation with Monitoring and Unsupervised Learning“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren