Skip to main navigation Skip to search Skip to main content

Multilingual Augmentation for Robust Visual Question Answering in Remote Sensing Images

  • Technical University of Munich

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

17 Scopus citations

Abstract

Aiming at answering questions based on the content of remotely sensed images, visual question answering for remote sensing data (RSVQA) has attracted much attention nowadays. However, previous works in RSVQA have focused little on the robustness of RSVQA. As we aim to enhance the reliability of RSVQA models, how to learn robust representations against new words and different question templates with the same meaning is the key challenge. With the proposed augmented dataset, we are able to obtain more questions in addition to the original ones with the same meaning. To make better use of this information, in this study, we propose a contrastive learning strategy for training robust RSVQA models against diverse question templates and words. Experimental results demonstrate that the proposed augmented dataset is effective in improving the robustness of the RSVQA model. In addition, the contrastive learning strategy performs well on the low resolution (LR) dataset.

Original languageEnglish
Title of host publication2023 Joint Urban Remote Sensing Event, JURSE 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781665493734
DOIs
StatePublished - 2023
Event2023 Joint Urban Remote Sensing Event, JURSE 2023 - Heraklion, Greece
Duration: 17 May 202319 May 2023

Publication series

Name2023 Joint Urban Remote Sensing Event, JURSE 2023

Conference

Conference2023 Joint Urban Remote Sensing Event, JURSE 2023
Country/TerritoryGreece
CityHeraklion
Period17/05/2319/05/23

Keywords

  • Remote sensing
  • deep learning
  • robustness
  • visual question answering (VQA)

Fingerprint

Dive into the research topics of 'Multilingual Augmentation for Robust Visual Question Answering in Remote Sensing Images'. Together they form a unique fingerprint.

Cite this