TY - GEN
T1 - D 3 Net
T2 - 17th European Conference on Computer Vision, ECCV 2022
AU - Chen, Dave Zhenyu
AU - Wu, Qirui
AU - Nießner, Matthias
AU - Chang, Angel X.
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - Recent work on dense captioning and visual grounding in 3D have achieved impressive results. Despite developments in both areas, the limited amount of available 3D vision-language data causes overfitting issues for 3D visual grounding and 3D dense captioning methods. Also, how to discriminatively describe objects in complex 3D environments is not fully studied yet. To address these challenges, we present D 3 Net, an end-to-end neural speaker-listener architecture that can detect, describe and discriminate. Our D 3 Net unifies dense captioning and visual grounding in 3D in a self-critical manner. This self-critical property of D 3 Net encourages generation of discriminative object captions and enables semi-supervised training on scan data with partially annotated descriptions. Our method outperforms SOTA methods in both tasks on the ScanRefer dataset, surpassing the SOTA 3D dense captioning method by a significant margin.
AB - Recent work on dense captioning and visual grounding in 3D have achieved impressive results. Despite developments in both areas, the limited amount of available 3D vision-language data causes overfitting issues for 3D visual grounding and 3D dense captioning methods. Also, how to discriminatively describe objects in complex 3D environments is not fully studied yet. To address these challenges, we present D 3 Net, an end-to-end neural speaker-listener architecture that can detect, describe and discriminate. Our D 3 Net unifies dense captioning and visual grounding in 3D in a self-critical manner. This self-critical property of D 3 Net encourages generation of discriminative object captions and enables semi-supervised training on scan data with partially annotated descriptions. Our method outperforms SOTA methods in both tasks on the ScanRefer dataset, surpassing the SOTA 3D dense captioning method by a significant margin.
UR - http://www.scopus.com/inward/record.url?scp=85144541378&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-19824-3_29
DO - 10.1007/978-3-031-19824-3_29
M3 - Conference contribution
AN - SCOPUS:85144541378
SN - 9783031198236
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 487
EP - 505
BT - Computer Vision – ECCV 2022 - 17th European Conference, Proceedings
A2 - Avidan, Shai
A2 - Brostow, Gabriel
A2 - Cissé, Moustapha
A2 - Farinella, Giovanni Maria
A2 - Hassner, Tal
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 23 October 2022 through 27 October 2022
ER -