TY - JOUR
T1 - End-to-End Deployment of Winograd-Based DNNs on Edge GPU †
AU - Mori, Pierpaolo
AU - Rahman, Mohammad Shanur
AU - Frickenstein, Lukas
AU - Sampath, Shambhavi Balamuthu
AU - Thoma, Moritz
AU - Fasfous, Nael
AU - Vemparala, Manoj Rohit
AU - Frickenstein, Alexander
AU - Stechele, Walter
AU - Passerone, Claudio
N1 - Publisher Copyright:
© 2024 by the authors.
PY - 2024/11
Y1 - 2024/11
N2 - The Winograd algorithm reduces the computational complexity of convolutional neural networks (CNNs) by minimizing the number of multiplications required for convolutions, making it particularly suitable for resource-constrained edge devices. Concurrently, most edge hardware accelerators utilize 8-bit integer arithmetic to enhance energy efficiency and reduce inference latency, requiring the quantization of CNNs before deployment. Combining Winograd-based convolution with quantization offers the potential for both performance acceleration and reduced energy consumption. However, prior research has identified significant challenges in this combination, particularly due to numerical instability and substantial accuracy degradation caused by the transformations required in the Winograd domain, making the two techniques incompatible on edge hardware. In this work, we describe our latest training scheme, which addresses these challenges, enabling the successful integration of Winograd-accelerated convolution with low-precision quantization while maintaining high task-related accuracy. Our approach mitigates the numerical instability typically introduced during the transformation, ensuring compatibility between the two techniques. Additionally, we extend our work by presenting a custom-optimized CUDA implementation of quantized Winograd convolution for NVIDIA edge GPUs. This implementation takes full advantage of the proposed training scheme, achieving both high computational efficiency and accuracy, making it a compelling solution for edge-based AI applications. Our training approach enables significant MAC reduction with minimal impact on prediction quality. Furthermore, our hardware results demonstrate up to a 3.4× latency reduction for specific layers, and a 1.44× overall reduction in latency for the entire DeepLabV3 model, compared to the standard implementation.
AB - The Winograd algorithm reduces the computational complexity of convolutional neural networks (CNNs) by minimizing the number of multiplications required for convolutions, making it particularly suitable for resource-constrained edge devices. Concurrently, most edge hardware accelerators utilize 8-bit integer arithmetic to enhance energy efficiency and reduce inference latency, requiring the quantization of CNNs before deployment. Combining Winograd-based convolution with quantization offers the potential for both performance acceleration and reduced energy consumption. However, prior research has identified significant challenges in this combination, particularly due to numerical instability and substantial accuracy degradation caused by the transformations required in the Winograd domain, making the two techniques incompatible on edge hardware. In this work, we describe our latest training scheme, which addresses these challenges, enabling the successful integration of Winograd-accelerated convolution with low-precision quantization while maintaining high task-related accuracy. Our approach mitigates the numerical instability typically introduced during the transformation, ensuring compatibility between the two techniques. Additionally, we extend our work by presenting a custom-optimized CUDA implementation of quantized Winograd convolution for NVIDIA edge GPUs. This implementation takes full advantage of the proposed training scheme, achieving both high computational efficiency and accuracy, making it a compelling solution for edge-based AI applications. Our training approach enables significant MAC reduction with minimal impact on prediction quality. Furthermore, our hardware results demonstrate up to a 3.4× latency reduction for specific layers, and a 1.44× overall reduction in latency for the entire DeepLabV3 model, compared to the standard implementation.
KW - CNN
KW - GPU
KW - hardware accelerator
KW - NVIDIA
KW - quantization
KW - Winograd convolution
UR - http://www.scopus.com/inward/record.url?scp=85210236563&partnerID=8YFLogxK
U2 - 10.3390/electronics13224538
DO - 10.3390/electronics13224538
M3 - Article
AN - SCOPUS:85210236563
SN - 2079-9292
VL - 13
JO - Electronics (Switzerland)
JF - Electronics (Switzerland)
IS - 22
M1 - 4538
ER -