TY - JOUR
T1 - Targeting DNN Inference Via Efficient Utilization of Heterogeneous Precision DNN Accelerators
AU - Spantidi, Ourania
AU - Zervakis, Georgios
AU - Alsalamin, Sami
AU - Roman-Ballesteros, Isai
AU - Henkel, Jorg
AU - Amrouch, Hussam
AU - Anagnostopoulos, Iraklis
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2023/1/1
Y1 - 2023/1/1
N2 - Modern applications rely more and more on the simultaneous execution of multiple DNNs, and Heterogeneous DNN Accelerators (HDAs) prevail as a solution to this trend. In this work, we propose, implement, and evaluate low precision Neural Processing Units (NPUs) which serve as building blocks to construct HDAs, to address the efficient deployment of multi-DNN workloads. Moreover, we design and evaluate HDA designs that increase the overall throughput, while reducing the energy consumption during NN inference. At the design time, we implement HDAs inspired by the big.LITTLE computing paradigm, consisting of 8-bit NPUs together with lower precision bit-width NPUs. Additionally, an NN-to-NPU scheduling methodology is implemented to decide at run-time how to map the executed NN to the suitable NPU based on an accuracy drop threshold value. Our hardware/software co-design reduces the energy and response time of NNs by 29% and 10% respectively when compared to state-of-the-art homogeneous architectures. This comes with a negligible accuracy drop of merely 0.5%. Similar to the traditional CPU big.LITTLE, our asymmetric NPU design can open new doors for designing novel DNN accelerator architectures, due to their profound role in increasing the efficiency of DNNs with minimal losses in accuracy.
AB - Modern applications rely more and more on the simultaneous execution of multiple DNNs, and Heterogeneous DNN Accelerators (HDAs) prevail as a solution to this trend. In this work, we propose, implement, and evaluate low precision Neural Processing Units (NPUs) which serve as building blocks to construct HDAs, to address the efficient deployment of multi-DNN workloads. Moreover, we design and evaluate HDA designs that increase the overall throughput, while reducing the energy consumption during NN inference. At the design time, we implement HDAs inspired by the big.LITTLE computing paradigm, consisting of 8-bit NPUs together with lower precision bit-width NPUs. Additionally, an NN-to-NPU scheduling methodology is implemented to decide at run-time how to map the executed NN to the suitable NPU based on an accuracy drop threshold value. Our hardware/software co-design reduces the energy and response time of NNs by 29% and 10% respectively when compared to state-of-the-art homogeneous architectures. This comes with a negligible accuracy drop of merely 0.5%. Similar to the traditional CPU big.LITTLE, our asymmetric NPU design can open new doors for designing novel DNN accelerator architectures, due to their profound role in increasing the efficiency of DNNs with minimal losses in accuracy.
KW - Approximate computing
KW - deep neural networks
KW - hardware-software co-design
KW - heterogeneous (approximate) accelerators
KW - low-power
KW - systolic MAC array
UR - http://www.scopus.com/inward/record.url?scp=85131721597&partnerID=8YFLogxK
U2 - 10.1109/TETC.2022.3178730
DO - 10.1109/TETC.2022.3178730
M3 - Article
AN - SCOPUS:85131721597
SN - 2168-6750
VL - 11
SP - 112
EP - 125
JO - IEEE Transactions on Emerging Topics in Computing
JF - IEEE Transactions on Emerging Topics in Computing
IS - 1
ER -