Abstract
Graphic Processing Units (GPUs) play a vital role in state-of-the-art high-performance scientific computing realm and research work towards its performance analysis is crucial but nontrivial. Extant GPU performance models are far from practical use, while fine-grained GPU simulation requires a considerably large time cost. Moreover, massive amounts of designs with various program inputs and parameter settings pose a challenge for efficient performance estimation and tuning of parallel GPU applications. To this end, this article presents a hybrid framework for the efficient performance estimation and work-group size pruning of OpenCL workloads on GPUs. The framework contains a static module used to extract the kernel execution trace from the high-level source code and a dynamical module used to mimic the kernel execution flow to estimate the runtime performance. For the design space pruning, an extra analysis is performed to filter out the redundant work-group sizes with duplicated execution traces and inferior pipelines. The proposed framework does not require any program runs to estimate the performance and find the optimal or near-optimal designs. Experiments on four Commercial Off-The-Shelf (COTS) Nvidia GPUs show that the framework can predict the runtime performance with an average error of 17.04 percent and reduce the program design space by an average of 78.47 percent.
Original language | English |
---|---|
Article number | 8928962 |
Pages (from-to) | 1089-1106 |
Number of pages | 18 |
Journal | IEEE Transactions on Parallel and Distributed Systems |
Volume | 31 |
Issue number | 5 |
DOIs | |
State | Published - 1 May 2020 |
Keywords
- GPU
- OpenCL
- performance estimation
- performance tuning
- work-group size