TY - GEN
T1 - Reconciling QoS and Concurrency in NVIDIA GPUs via Warp-Level Scheduling
AU - Singh, Jayati
AU - Olmedo, Ignacio Sanudo
AU - Capodieci, Nicola
AU - Marongiu, Andrea
AU - Caccamo, Marco
N1 - Publisher Copyright:
© 2022 EDAA.
PY - 2022
Y1 - 2022
N2 - The widespread deployment of NVIDIA GPUs in latency-sensitive systems today requires predictable GPU multi-tasking, which cannot be trivially achieved. The NVIDIA CUDA API allows programmers to easily exploit the processing power provided by these massively parallel accelerators and is one of the major reasons behind their ubiquity. However, NVIDIA GPUs and the CUDA programming model favor throughput instead of latency and timing predictability. Hence, providing real-time and quality-of-service (QoS) properties to GPU applications presents an interesting research challenge. Such a challenge is paramount when considering simultaneous multikernel (SMK) scenarios, wherein kernels are executed concurrently within each streaming multiprocessor (SM). In this work, we explore QoS-based fine-grained multitasking in SMK via job arbitration at the lowest level of the GPU scheduling hierarchy, i.e., between warps. We present QoS-aware warp scheduling (QAWS) and evaluate it against state-of-the-art, kernel-agnostic policies seen in NVIDIA hardware today. Since the NVIDIA ecosystem lacks a mechanism to specify and enforce kernel priority at the warp granularity, we implement and evaluate our proposed warp scheduling policy on GPGPU-Sim. QAWS not only improves the response time of the higher priority tasks but also has comparable or better throughput than the state-of-the-art policies.
AB - The widespread deployment of NVIDIA GPUs in latency-sensitive systems today requires predictable GPU multi-tasking, which cannot be trivially achieved. The NVIDIA CUDA API allows programmers to easily exploit the processing power provided by these massively parallel accelerators and is one of the major reasons behind their ubiquity. However, NVIDIA GPUs and the CUDA programming model favor throughput instead of latency and timing predictability. Hence, providing real-time and quality-of-service (QoS) properties to GPU applications presents an interesting research challenge. Such a challenge is paramount when considering simultaneous multikernel (SMK) scenarios, wherein kernels are executed concurrently within each streaming multiprocessor (SM). In this work, we explore QoS-based fine-grained multitasking in SMK via job arbitration at the lowest level of the GPU scheduling hierarchy, i.e., between warps. We present QoS-aware warp scheduling (QAWS) and evaluate it against state-of-the-art, kernel-agnostic policies seen in NVIDIA hardware today. Since the NVIDIA ecosystem lacks a mechanism to specify and enforce kernel priority at the warp granularity, we implement and evaluate our proposed warp scheduling policy on GPGPU-Sim. QAWS not only improves the response time of the higher priority tasks but also has comparable or better throughput than the state-of-the-art policies.
UR - http://www.scopus.com/inward/record.url?scp=85130828719&partnerID=8YFLogxK
U2 - 10.23919/DATE54114.2022.9774761
DO - 10.23919/DATE54114.2022.9774761
M3 - Conference contribution
AN - SCOPUS:85130828719
T3 - Proceedings of the 2022 Design, Automation and Test in Europe Conference and Exhibition, DATE 2022
SP - 1275
EP - 1280
BT - Proceedings of the 2022 Design, Automation and Test in Europe Conference and Exhibition, DATE 2022
A2 - Bolchini, Cristiana
A2 - Verbauwhede, Ingrid
A2 - Vatajelu, Ioana
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 Design, Automation and Test in Europe Conference and Exhibition, DATE 2022
Y2 - 14 March 2022 through 23 March 2022
ER -