D 3 Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding

Dave Zhenyu Chen, Qirui Wu, Matthias Nießner, Angel X. Chang

Publikation: Beitrag in Buch/Bericht/KonferenzbandKonferenzbeitragBegutachtung

11 Zitate (Scopus)

Abstract

Recent work on dense captioning and visual grounding in 3D have achieved impressive results. Despite developments in both areas, the limited amount of available 3D vision-language data causes overfitting issues for 3D visual grounding and 3D dense captioning methods. Also, how to discriminatively describe objects in complex 3D environments is not fully studied yet. To address these challenges, we present D 3 Net, an end-to-end neural speaker-listener architecture that can detect, describe and discriminate. Our D 3 Net unifies dense captioning and visual grounding in 3D in a self-critical manner. This self-critical property of D 3 Net encourages generation of discriminative object captions and enables semi-supervised training on scan data with partially annotated descriptions. Our method outperforms SOTA methods in both tasks on the ScanRefer dataset, surpassing the SOTA 3D dense captioning method by a significant margin.

OriginalspracheEnglisch
TitelComputer Vision – ECCV 2022 - 17th European Conference, Proceedings
Redakteure/-innenShai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
Herausgeber (Verlag)Springer Science and Business Media Deutschland GmbH
Seiten487-505
Seitenumfang19
ISBN (Print)9783031198236
DOIs
PublikationsstatusVeröffentlicht - 2022
Veranstaltung17th European Conference on Computer Vision, ECCV 2022 - Tel Aviv, Israel
Dauer: 23 Okt. 202227 Okt. 2022

Publikationsreihe

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Band13692 LNCS
ISSN (Print)0302-9743
ISSN (elektronisch)1611-3349

Konferenz

Konferenz17th European Conference on Computer Vision, ECCV 2022
Land/GebietIsrael
OrtTel Aviv
Zeitraum23/10/2227/10/22

Fingerprint

Untersuchen Sie die Forschungsthemen von „D 3 Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren