TY - JOUR
T1 - VariViT
T2 - 7th International Conference on Medical Imaging with Deep Learning, MIDL 2024
AU - Varma, Aswathi
AU - Shit, Suprosanna
AU - Prabhakar, Chinmay
AU - Scholz, Daniel
AU - Li, Hongwei Bran
AU - Menze, Bjoern
AU - Rueckert, Daniel
AU - Wiestler, Benedikt
N1 - Publisher Copyright:
© 2024 CC-BY 4.0, A. Varma, S. Shit, C. Prabhakar, D. Scholz, H.B. Li, B. Menze, D. Rueckert & B. Wiestler.
PY - 2024
Y1 - 2024
N2 - Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artefacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1-scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning.
AB - Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artefacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1-scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning.
KW - Architecture
KW - Representation
KW - Tumor Classification
KW - Vision Transformers
UR - http://www.scopus.com/inward/record.url?scp=85199973424&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85199973424
SN - 2640-3498
VL - 250
SP - 1571
EP - 1583
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
Y2 - 3 July 2024 through 5 July 2024
ER -