TY - GEN
T1 - ViewDiff
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
AU - Höllein, Lukas
AU - Božić, Aljaž
AU - Müller, Norman
AU - Novotny, David
AU - Tseng, Hung Yu
AU - Richardt, Christian
AU - Zollhöfer, Michael
AU - Nießner, Matthias
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - 3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods, the results generated by our method are consistent, and have favorable visual quality (-30% FID, -37% KID).
AB - 3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods, the results generated by our method are consistent, and have favorable visual quality (-30% FID, -37% KID).
KW - 3D generation
KW - diffusion models
KW - multi-view generation
KW - text-to-3D
UR - http://www.scopus.com/inward/record.url?scp=85192157366&partnerID=8YFLogxK
U2 - 10.1109/CVPR52733.2024.00482
DO - 10.1109/CVPR52733.2024.00482
M3 - Conference contribution
AN - SCOPUS:85192157366
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 5043
EP - 5052
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PB - IEEE Computer Society
Y2 - 16 June 2024 through 22 June 2024
ER -