D 3 Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding

Dave Zhenyu Chen, Qirui Wu, Matthias Nießner, Angel X. Chang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

11 Scopus citations

Abstract

Recent work on dense captioning and visual grounding in 3D have achieved impressive results. Despite developments in both areas, the limited amount of available 3D vision-language data causes overfitting issues for 3D visual grounding and 3D dense captioning methods. Also, how to discriminatively describe objects in complex 3D environments is not fully studied yet. To address these challenges, we present D 3 Net, an end-to-end neural speaker-listener architecture that can detect, describe and discriminate. Our D 3 Net unifies dense captioning and visual grounding in 3D in a self-critical manner. This self-critical property of D 3 Net encourages generation of discriminative object captions and enables semi-supervised training on scan data with partially annotated descriptions. Our method outperforms SOTA methods in both tasks on the ScanRefer dataset, surpassing the SOTA 3D dense captioning method by a significant margin.

Original languageEnglish
Title of host publicationComputer Vision – ECCV 2022 - 17th European Conference, Proceedings
EditorsShai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
PublisherSpringer Science and Business Media Deutschland GmbH
Pages487-505
Number of pages19
ISBN (Print)9783031198236
DOIs
StatePublished - 2022
Event17th European Conference on Computer Vision, ECCV 2022 - Tel Aviv, Israel
Duration: 23 Oct 202227 Oct 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13692 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference17th European Conference on Computer Vision, ECCV 2022
Country/TerritoryIsrael
CityTel Aviv
Period23/10/2227/10/22

Fingerprint

Dive into the research topics of 'D 3 Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding'. Together they form a unique fingerprint.

Cite this