Transformers Pay Attention to Convolutions Leveraging Emerging Properties of ViTs by Dual Attention-Image Network

Yousef Yeganeh, Azade Farshad, Peter Weinberger, Seyed Ahmad Ahmadi, Ehsan Adeli, Nassir Navab

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Although purely transformer-based architectures pretrained on large datasets are introduced as foundation models for general computer vision tasks, hybrid models that incorporate combinations of convolution and transformer blocks showed state-of-the-art performance in more specialized tasks. Nevertheless, despite the performance gain of both pure and hybrid transformer-based architectures compared to convolutional networks, their high training cost and complexity make it challenging to use them in real scenarios. In this work, we propose a novel and simple architecture based on only convolutional layers and show that by just taking advantage of the attention map visualizations obtained from a self-supervised pretrained vision transformer network, complex transformer-based networks, and even 3D architectures are outperformed with much fewer computation costs. The proposed architecture is composed of two encoder branches with the original image as input in one branch and the attention map visualizations of the same image from multiple self-attention heads from a pre-trained DINO model in the other branch. The results of our experiments on medical imaging datasets show that the extracted attention map visualizations from the attention heads of a pre-trained transformer architecture combined with the image provide strong prior knowledge for a pure CNN architecture to outperform CNN-based and transformer-based architectures. Project Page: dai-net.github.io

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages2296-2307
Number of pages12
ISBN (Electronic)9798350307443
DOIs
StatePublished - 2023
Event2023 IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2023 - Paris, France
Duration: 2 Oct 20236 Oct 2023

Publication series

NameProceedings - 2023 IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2023

Conference

Conference2023 IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2023
Country/TerritoryFrance
CityParis
Period2/10/236/10/23

Keywords

  • Attention Map
  • Medical Imaging
  • Segmentation
  • Transformers

Fingerprint

Dive into the research topics of 'Transformers Pay Attention to Convolutions Leveraging Emerging Properties of ViTs by Dual Attention-Image Network'. Together they form a unique fingerprint.

Cite this