TY - GEN
T1 - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines
AU - Isenko, Alexander
AU - Mayer, Ruben
AU - Jedele, Jeffrey
AU - Jacobsen, Hans Arno
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/6/10
Y1 - 2022/6/10
N2 - Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. Maximizing resource utilization is becoming more challenging as the throughput of training processes increases with hardware innovations (e.g., faster GPUs, TPUs, and inter-connects) and advanced parallelization techniques that yield better scalability. At the same time, the amount of training data needed in order to train increasingly complex models is growing. As a consequence of this development, data preprocessing and provisioning are becoming a severe bottleneck in end-to-end deep learning pipelines. In this paper, we provide an in-depth analysis of data preprocessing pipelines from four different machine learning domains. We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines and extract individual trade-offs to optimize throughput, preprocessing time, and storage consumption. Additionally, we provide an open-source profiling library that can automatically decide on a suitable preprocessing strategy to maximize throughput. By applying our generated insights to real-world use-cases, we obtain an increased throughput of 3x to 13x compared to an untuned system while keeping the pipeline functionally identical. These findings show the enormous potential of data pipeline tuning.
AB - Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. Maximizing resource utilization is becoming more challenging as the throughput of training processes increases with hardware innovations (e.g., faster GPUs, TPUs, and inter-connects) and advanced parallelization techniques that yield better scalability. At the same time, the amount of training data needed in order to train increasingly complex models is growing. As a consequence of this development, data preprocessing and provisioning are becoming a severe bottleneck in end-to-end deep learning pipelines. In this paper, we provide an in-depth analysis of data preprocessing pipelines from four different machine learning domains. We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines and extract individual trade-offs to optimize throughput, preprocessing time, and storage consumption. Additionally, we provide an open-source profiling library that can automatically decide on a suitable preprocessing strategy to maximize throughput. By applying our generated insights to real-world use-cases, we obtain an increased throughput of 3x to 13x compared to an untuned system while keeping the pipeline functionally identical. These findings show the enormous potential of data pipeline tuning.
KW - data processing
KW - datasets
KW - deep learning
KW - machine learning
KW - preprocessing
UR - http://www.scopus.com/inward/record.url?scp=85128548093&partnerID=8YFLogxK
U2 - 10.1145/3514221.3517848
DO - 10.1145/3514221.3517848
M3 - Conference contribution
AN - SCOPUS:85128548093
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 1825
EP - 1839
BT - SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data
PB - Association for Computing Machinery
T2 - 2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022
Y2 - 12 June 2022 through 17 June 2022
ER -