TY - GEN
T1 - Characterizing HPC Performance Variation with Monitoring and Unsupervised Learning
AU - Ozer, Gence
AU - Netti, Alessio
AU - Tafani, Daniele
AU - Schulz, Martin
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - As HPC systems grow larger and more complex, characterizing the relationships between their different components and gaining insight on their behavior becomes difficult. In turn, this puts a burden on both system administrators and developers who aim at improving the efficiency and reliability of systems, algorithms and applications. Automated approaches capable of extracting a system’s behavior, as well as identifying anomalies and outliers, are necessary more than ever. In this work we discuss our exploratory study of Bayesian Gaussian mixture models, an unsupervised machine learning technique, to characterize the performance of an HPC system’s components, as well as to identify anomalies, based on sensor data. We propose an algorithmic framework for this purpose, implement it within the DCDB monitoring and operational data analytics system, and present several case studies carried out using data from a production HPC system.
AB - As HPC systems grow larger and more complex, characterizing the relationships between their different components and gaining insight on their behavior becomes difficult. In turn, this puts a burden on both system administrators and developers who aim at improving the efficiency and reliability of systems, algorithms and applications. Automated approaches capable of extracting a system’s behavior, as well as identifying anomalies and outliers, are necessary more than ever. In this work we discuss our exploratory study of Bayesian Gaussian mixture models, an unsupervised machine learning technique, to characterize the performance of an HPC system’s components, as well as to identify anomalies, based on sensor data. We propose an algorithmic framework for this purpose, implement it within the DCDB monitoring and operational data analytics system, and present several case studies carried out using data from a production HPC system.
KW - Anomaly detection
KW - Clustering
KW - HPC systems
KW - Monitoring
KW - Operational data analytics
UR - http://www.scopus.com/inward/record.url?scp=85096424245&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-59851-8_18
DO - 10.1007/978-3-030-59851-8_18
M3 - Conference contribution
AN - SCOPUS:85096424245
SN - 9783030598501
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 280
EP - 292
BT - High Performance Computing - ISC High Performance 2020 International Workshops, Revised Selected Papers
A2 - Jagode, Heike
A2 - Anzt, Hartwig
A2 - Juckeland, Guido
A2 - Ltaief, Hatem
PB - Springer Science and Business Media Deutschland GmbH
T2 - 35th International Conference on High Performance Computing , ISC High Performance 2020
Y2 - 21 June 2020 through 25 June 2020
ER -