TY - GEN
T1 - Benchmarking Generative AI Models for Deep Learning Test Input Generation
AU - Maryam, Maryam
AU - Biagiola, Matteo
AU - Stocco, Andrea
AU - Riccio, Vincenzo
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Test Input Generators (TIGs) are crucial to assess the ability of Deep Learning (DL) image classifiers to provide correct predictions for inputs beyond their training and test sets. Recent advancements in Generative AI(GenAI) models have made them a powerful tool for creating and manipulating synthetic images, although these advancements also imply increased complexity and resource demands for training. In this work, we benchmark and combine different GenAI models with TIGs, assessing their effectiveness, efficiency, and quality of the generated test images, in terms of domain validity and label preservation. We conduct an empirical study involving three different GenAI architectures (VAEs, GANs, Diffusion Models), five classification tasks of increasing complexity, and 364 human evaluations. Our results show that simpler architectures, such as VAEs, are sufficient for less complex datasets like MNIST. However, when dealing with feature-rich datasets, such as ImageNet, more sophisticated architectures like Diffusion Models achieve superior performance by generating a higher number of valid, misclassification-inducing inputs.
AB - Test Input Generators (TIGs) are crucial to assess the ability of Deep Learning (DL) image classifiers to provide correct predictions for inputs beyond their training and test sets. Recent advancements in Generative AI(GenAI) models have made them a powerful tool for creating and manipulating synthetic images, although these advancements also imply increased complexity and resource demands for training. In this work, we benchmark and combine different GenAI models with TIGs, assessing their effectiveness, efficiency, and quality of the generated test images, in terms of domain validity and label preservation. We conduct an empirical study involving three different GenAI architectures (VAEs, GANs, Diffusion Models), five classification tasks of increasing complexity, and 364 human evaluations. Our results show that simpler architectures, such as VAEs, are sufficient for less complex datasets like MNIST. However, when dealing with feature-rich datasets, such as ImageNet, more sophisticated architectures like Diffusion Models achieve superior performance by generating a higher number of valid, misclassification-inducing inputs.
KW - Deep Learning
KW - Generative AI
KW - Software Testing
UR - http://www.scopus.com/inward/record.url?scp=105007557432&partnerID=8YFLogxK
U2 - 10.1109/ICST62969.2025.10989043
DO - 10.1109/ICST62969.2025.10989043
M3 - Conference contribution
AN - SCOPUS:105007557432
T3 - 2025 IEEE Conference on Software Testing, Verification and Validation, ICST 2025
SP - 174
EP - 185
BT - 2025 IEEE Conference on Software Testing, Verification and Validation, ICST 2025
A2 - Fasolino, Anna Rita
A2 - Panichella, Sebastiano
A2 - Aleti, Aldeida
A2 - Mesbah, Ali
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 18th IEEE Conference on Software Testing, Verification and Validation, ICST 2025
Y2 - 31 March 2025 through 4 April 2025
ER -