TY - GEN
T1 - sys-sage
T2 - 38th ACM International Conference on Supercomputing, ICS 2024
AU - Vanecek, Stepan
AU - Schulz, Martin
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/5/30
Y1 - 2024/5/30
N2 - HPC systems are getting ever more powerful, but this comes at the price of increasing system complexity: node architectures are deeply hierarchical and in many cases heterogeneous, and components can interact with each other in unpredictable ways. Further, current and future systems exhibit increasingly dynamic behavior, making static knowledge of their configuration alone insufficient. To use such systems efficiently, users as well as runtime systems have to be aware of the exact hardware structure at any time, i.e., the systems topology, its configuration parameters, and any side-effect a component can have on the rest of the system, and how this changes over time. Current approaches to providing such information usually focus on a single aspect and do not consider dynamic behavior. For example, the widely used hwloc library, the current de-facto standard solution for retrieving hardware topology information, provides a static hierarchical view of all node hardware, but neither covers other system configuration aspects nor dynamic behavior; other systems have similar limitations. In this paper, we propose sys-sage, a novel approach that overcomes these limitations and goes beyond the functionality of existing tools, including hwloc. It offers the ability to track dynamic changes, while unifying access to all system topology and configuration data. With that, it provides, at any point in time, a complete and updated view of the HPC system on which an application or runtime system is executing. The novelty of our approach lies in the ability to combine static hardware topology information with other relevant system data in a single API, while enabling a dynamic view and exposing system updates and reconfigurations on the fly. We show the design of sys-sage and demonstrate its applicability based on three separate use-cases, as well as by presenting further scenarios not easily solvable with currently available tools.
AB - HPC systems are getting ever more powerful, but this comes at the price of increasing system complexity: node architectures are deeply hierarchical and in many cases heterogeneous, and components can interact with each other in unpredictable ways. Further, current and future systems exhibit increasingly dynamic behavior, making static knowledge of their configuration alone insufficient. To use such systems efficiently, users as well as runtime systems have to be aware of the exact hardware structure at any time, i.e., the systems topology, its configuration parameters, and any side-effect a component can have on the rest of the system, and how this changes over time. Current approaches to providing such information usually focus on a single aspect and do not consider dynamic behavior. For example, the widely used hwloc library, the current de-facto standard solution for retrieving hardware topology information, provides a static hierarchical view of all node hardware, but neither covers other system configuration aspects nor dynamic behavior; other systems have similar limitations. In this paper, we propose sys-sage, a novel approach that overcomes these limitations and goes beyond the functionality of existing tools, including hwloc. It offers the ability to track dynamic changes, while unifying access to all system topology and configuration data. With that, it provides, at any point in time, a complete and updated view of the HPC system on which an application or runtime system is executing. The novelty of our approach lies in the ability to combine static hardware topology information with other relevant system data in a single API, while enabling a dynamic view and exposing system updates and reconfigurations on the fly. We show the design of sys-sage and demonstrate its applicability based on three separate use-cases, as well as by presenting further scenarios not easily solvable with currently available tools.
KW - Hardware Architecture
KW - Heterogeneous Computing
KW - HPC System Topology
KW - Performance Optimizations
UR - http://www.scopus.com/inward/record.url?scp=85196317912&partnerID=8YFLogxK
U2 - 10.1145/3650200.3656627
DO - 10.1145/3650200.3656627
M3 - Conference contribution
AN - SCOPUS:85196317912
T3 - Proceedings of the International Conference on Supercomputing
SP - 363
EP - 375
BT - ICS 2024 - Proceedings of the 38th ACM International Conference on Supercomputing
PB - Association for Computing Machinery
Y2 - 4 June 2024 through 7 June 2024
ER -