Reinforcement Learning-Driven Co-Scheduling and Diverse Resource Assignments on NUMA Systems

Urvij Saroliya, Eishi Arima, Dai Liu, Martin Schulz

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

As modern HPC systems are typically composed of fat and rich compute nodes, it is usually difficult to fully utilize all node resources with a single application. Co-scheduling, i.e., co-executing multiple complementary applications (or jobs) on the same node in a space sharing manner, is a promising solution and thus has been widely studied in the past decade. As one major drawback of co-scheduling is that it induces the interference effects among co-located applications due to contention among shared resources, the industry has started to support several resource/traffic partitioning features, e.g., in shared caches or memory controllers, on modern commercial processors. Recent studies proposed effective approaches to make use of these advanced features, however, the interactions between these features and (1) job scheduling decisions as well as (2) NUMA (Non-Uniform Memory Access) effects were generally overlooked. This paper explicitly targets these two missing pieces and comprehensively harmonizes the following decisions using reinforcement learning: (a) job selections for co-execution from a given job queue; and (b) diverse resource assignments to co-executed jobs, leveraging emerging hardware partitioning features, while taking NUMA-awareness into account. Our evaluation result demonstrates that our approach can improve the total system throughput by up to 78.1% over time sharing-based naive scheduling.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE 42nd International Conference on Computer Design, ICCD 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages170-178
Number of pages9
ISBN (Electronic)9798350380408
DOIs
StatePublished - 2024
Event42nd IEEE International Conference on Computer Design, ICCD 2024 - Milan, Italy
Duration: 18 Nov 202420 Nov 2024

Publication series

NameProceedings - IEEE International Conference on Computer Design: VLSI in Computers and Processors
ISSN (Print)1063-6404

Conference

Conference42nd IEEE International Conference on Computer Design, ICCD 2024
Country/TerritoryItaly
CityMilan
Period18/11/2420/11/24

Keywords

  • Co-Scheduling
  • NUMA Systems
  • Reinforcement Learning
  • Resource Management

Fingerprint

Dive into the research topics of 'Reinforcement Learning-Driven Co-Scheduling and Diverse Resource Assignments on NUMA Systems'. Together they form a unique fingerprint.

Cite this