TY - GEN
T1 - Reinforcement Learning-Driven Co-Scheduling and Diverse Resource Assignments on NUMA Systems
AU - Saroliya, Urvij
AU - Arima, Eishi
AU - Liu, Dai
AU - Schulz, Martin
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - As modern HPC systems are typically composed of fat and rich compute nodes, it is usually difficult to fully utilize all node resources with a single application. Co-scheduling, i.e., co-executing multiple complementary applications (or jobs) on the same node in a space sharing manner, is a promising solution and thus has been widely studied in the past decade. As one major drawback of co-scheduling is that it induces the interference effects among co-located applications due to contention among shared resources, the industry has started to support several resource/traffic partitioning features, e.g., in shared caches or memory controllers, on modern commercial processors. Recent studies proposed effective approaches to make use of these advanced features, however, the interactions between these features and (1) job scheduling decisions as well as (2) NUMA (Non-Uniform Memory Access) effects were generally overlooked. This paper explicitly targets these two missing pieces and comprehensively harmonizes the following decisions using reinforcement learning: (a) job selections for co-execution from a given job queue; and (b) diverse resource assignments to co-executed jobs, leveraging emerging hardware partitioning features, while taking NUMA-awareness into account. Our evaluation result demonstrates that our approach can improve the total system throughput by up to 78.1% over time sharing-based naive scheduling.
AB - As modern HPC systems are typically composed of fat and rich compute nodes, it is usually difficult to fully utilize all node resources with a single application. Co-scheduling, i.e., co-executing multiple complementary applications (or jobs) on the same node in a space sharing manner, is a promising solution and thus has been widely studied in the past decade. As one major drawback of co-scheduling is that it induces the interference effects among co-located applications due to contention among shared resources, the industry has started to support several resource/traffic partitioning features, e.g., in shared caches or memory controllers, on modern commercial processors. Recent studies proposed effective approaches to make use of these advanced features, however, the interactions between these features and (1) job scheduling decisions as well as (2) NUMA (Non-Uniform Memory Access) effects were generally overlooked. This paper explicitly targets these two missing pieces and comprehensively harmonizes the following decisions using reinforcement learning: (a) job selections for co-execution from a given job queue; and (b) diverse resource assignments to co-executed jobs, leveraging emerging hardware partitioning features, while taking NUMA-awareness into account. Our evaluation result demonstrates that our approach can improve the total system throughput by up to 78.1% over time sharing-based naive scheduling.
KW - Co-Scheduling
KW - NUMA Systems
KW - Reinforcement Learning
KW - Resource Management
UR - http://www.scopus.com/inward/record.url?scp=85217057144&partnerID=8YFLogxK
U2 - 10.1109/ICCD63220.2024.00034
DO - 10.1109/ICCD63220.2024.00034
M3 - Conference contribution
AN - SCOPUS:85217057144
T3 - Proceedings - IEEE International Conference on Computer Design: VLSI in Computers and Processors
SP - 170
EP - 178
BT - Proceedings - 2024 IEEE 42nd International Conference on Computer Design, ICCD 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 42nd IEEE International Conference on Computer Design, ICCD 2024
Y2 - 18 November 2024 through 20 November 2024
ER -