TY - JOUR
T1 - Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation
AU - Han, Keonhee
AU - Muhle, Dominik
AU - Wimbauer, Felix
AU - Cremers, Daniel
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Inferring scene geometry from images via Structure from Motion is a long-standing and fundamental problem in computer vision. While classical approaches and, more re-cently, depth map predictions only focus on the visible parts of a scene, the task of scene completion aims to reason about geometry even in occluded regions. With the popularity of neural radiance fields (NeRFs), implicit represen-tations also became popular for scene completion by pre-dicting so-called density fields. Unlike explicit approaches e.g. voxel-based methods, density fields also allow for ac-curate depth prediction and novel-view synthesis via image-based rendering. In this work, we propose to fuse the scene reconstruction from multiple images and distill this knowl-edge into a more accurate single-view scene reconstruction. To this end, we propose Multi- View Behind the Scenes (MVBTS) to fuse density fields from multiple posed images, trained fully self-supervised only from image data. Using knowledge distillation, we use MVBTS to train a single-view scene completion network via direct supervision called KDBTS. It achieves state-of-the-art performance on occu-pancy prediction, especially in occluded regions.
AB - Inferring scene geometry from images via Structure from Motion is a long-standing and fundamental problem in computer vision. While classical approaches and, more re-cently, depth map predictions only focus on the visible parts of a scene, the task of scene completion aims to reason about geometry even in occluded regions. With the popularity of neural radiance fields (NeRFs), implicit represen-tations also became popular for scene completion by pre-dicting so-called density fields. Unlike explicit approaches e.g. voxel-based methods, density fields also allow for ac-curate depth prediction and novel-view synthesis via image-based rendering. In this work, we propose to fuse the scene reconstruction from multiple images and distill this knowl-edge into a more accurate single-view scene reconstruction. To this end, we propose Multi- View Behind the Scenes (MVBTS) to fuse density fields from multiple posed images, trained fully self-supervised only from image data. Using knowledge distillation, we use MVBTS to train a single-view scene completion network via direct supervision called KDBTS. It achieves state-of-the-art performance on occu-pancy prediction, especially in occluded regions.
KW - Depth Estimation
KW - Single-View-Reconstruction
UR - http://www.scopus.com/inward/record.url?scp=85213353699&partnerID=8YFLogxK
U2 - 10.1109/CVPR52733.2024.00939
DO - 10.1109/CVPR52733.2024.00939
M3 - Conference article
AN - SCOPUS:85213353699
SN - 1063-6919
SP - 9837
EP - 9847
JO - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
JF - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Y2 - 16 June 2024 through 22 June 2024
ER -