Strong but Simple: A Baseline for Domain Generalized Dense Perception by CLIP-Based Transfer Learning

Christoph Hümmer, Manuel Schwonberg, Liangwei Zhou, Hu Cao, Alois Knoll, Hanno Gottschalk

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Domain generalization (DG) remains a significant challenge for perception based on deep neural networks (DNNs), where domain shifts occur due to synthetic data, lighting, weather, or location changes. Vision-language models (VLMs) marked a large step for the generalization capabilities and have been already applied to various tasks. Very recently, first approaches utilized VLMs for domain generalized segmentation and object detection and obtained strong generalization. However, all these approaches rely on complex modules, feature augmentation frameworks or additional models. Surprisingly and in contrast to that, we found that simple fine-tuning of vision-language pre-trained models yields competitive or even stronger generalization results while being extremely simple to apply. Moreover, we found that vision-language pre-training consistently provides better generalization than the previous standard of vision-only pre-training. This challenges the standard of using ImageNet-based transfer learning for domain generalization. Fully fine-tuning a vision-language pre-trained model is capable of reaching the domain generalization SOTA when training on the synthetic GTA5 dataset. Moreover, we confirm this observation for object detection on a novel synthetic-to-real benchmark. We further obtain superior generalization capabilities by reaching 77.9% mIoU on the popular Cityscapes→ ACDC benchmark. We also found improved in-domain generalization, leading to an improved SOTA of 86.4% mIoU on the Cityscapes test set marking the first place on the leaderboard.

Original languageEnglish
Title of host publicationComputer Vision – ACCV 2024 - 17th Asian Conference on Computer Vision, Proceedings
EditorsMinsu Cho, Ivan Laptev, Du Tran, Angela Yao, Hongbin Zha
PublisherSpringer Science and Business Media Deutschland GmbH
Pages463-484
Number of pages22
ISBN (Print)9789819609710
DOIs
StatePublished - 2025
Event17th Asian Conference on Computer Vision, ACCV 2024 - Hanoi, Viet Nam
Duration: 8 Dec 202412 Dec 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume15481 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference17th Asian Conference on Computer Vision, ACCV 2024
Country/TerritoryViet Nam
CityHanoi
Period8/12/2412/12/24

Keywords

  • Domain Generalization
  • Object Detection
  • Semantic Segmentation

Fingerprint

Dive into the research topics of 'Strong but Simple: A Baseline for Domain Generalized Dense Perception by CLIP-Based Transfer Learning'. Together they form a unique fingerprint.

Cite this