Skip to yearly menu bar Skip to main content


Oral Session

Oral Session 6A: 3D from Single or Multi-View Sensors

Sun 15 Jun 11 a.m. PDT — 12:30 p.m. PDT
Abstract:
Chat is not available.

Sun 15 June 11:00 - 11:15 PDT

Award Candidate
DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

Jay Zhangjie Wu · Yuxuan Zhang · Haithem Turki · Xuanchi Ren · Jun Gao · Mike Zheng Shou · Sanja Fidler · Žan Gojčič · Huan Ling

Neural Radiance Fields and 3D Gaussian Splatting have revolutionized 3D reconstruction and novel-view synthesis task. However, achieving photorealistic rendering from extreme novel viewpoints remains challenging, as artifacts persist across representations. In this work, we introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis through single-step diffusion models. At the core of our approach is Difix, a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation.Difix serves two critical roles in our pipeline. First, it is used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction and then distilled back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality. More importantly, Difix also acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D supervision and the limited capacity of current reconstruction models. Difix3D+ is a general solution, a single model compatible with both NeRF and 3DGS representations, and it achieves an average 2x improvement in FID score over baselines while maintaining 3D consistency.

Sun 15 June 11:15 - 11:30 PDT

3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting

Qi Wu · Janick Martinez Esturo · Ashkan Mirzaei · Nicolas Moënne-Loccoz · Žan Gojčič

3D Gaussian Splatting (3DGS) has shown great potential for efficient reconstruction and high-fidelity real-time rendering of complex scenes on consumer hardware. However, due to its rasterization-based formulation, 3DGS is constrained to ideal pinhole cameras and lacks support for secondary lighting effects. Recent methods address these limitations by tracing volumetric particles instead, however, this comes at the cost of significantly slower rendering speeds. In this work, we propose 3D Gaussian Unscented Transform (3DGUT), replacing the EWA splatting formulation in 3DGS with the Unscented Transform that approximates the particles through sigma points, which can be projected exactly under any nonlinear projection function. This modification enables trivial support of distorted cameras with time dependent effects such as rolling shutter, while retaining the efficiency of rasterization. Additionally, we align our rendering formulation with that of tracing-based methods, enabling secondary ray tracing required to represent phenomena such as reflections and refraction within the same 3D representation.

Sun 15 June 11:30 - 11:45 PDT

DNF: Unconditional 4D Generation with Dictionary-based Neural Fields

Xinyi Zhang · Naiqi Li · Angela Dai

While remarkable success has been achived through diffusion-based 3D generative models for shapes, 4D generative modeling remains challenging due to the complexity of object deformations over time. We propose DNF, a new 4D representation for unconditional generative modeling that efficiently models deformable shapes with disentangled shape and motion while capturing high-fidelity details in the deforming objects. To achieve this, we propose a dictionary learning approach to disentangle 4D motion from shape as neural fields.Both shape and motion are represented as learned latent spaces, where each deformable shape is represented by its shape and motion global latent codes, shape-specific coefficient vectors, and shared dictionary information. This captures both shape-specific detail and global shared information in the learned dictionary. Our dictionary-based representation well balances fidelity, contiguity and compression -- combined with a transformer-based diffusion model, our method is able to generate effective, high-fidelity 4D animations.

Sun 15 June 11:45 - 12:00 PDT

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

Rundi Wu · Ruiqi Gao · Ben Poole · Alex Trevithick · Changxi Zheng · Jonathan T. Barron · Aleksander Holynski

We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video. CAT4D leverages a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis at any specified camera poses and timestamps. Combined with a novel sampling approach, this model can transform a single monocular video into a multi-view video, enabling robust 4D reconstruction via optimization of a deformable 3D Gaussian representation. We demonstrate competitive performance on novel view synthesis and dynamic scene reconstruction benchmarks, and highlight the creative capabilities for 4D scene generation from real or generated videos.

Sun 15 June 12:00 - 12:15 PDT

Diffusion Renderer: Neural Inverse and Forward Rendering with Video Diffusion Models

Ruofan Liang · Žan Gojčič · Huan Ling · Jacob Munkberg · Jon Hasselgren · Chih-Hao Lin · Jun Gao · Alexander Keller · Nandita Vijaykumar · Sanja Fidler · Zian Wang

Understanding and modeling lighting effects are fundamental tasks in computer vision and graphics. Classic physically-based rendering (PBR) accurately simulates the light transport, but relies on precise scene representations--explicit 3D geometry, high-quality material properties, and lighting conditions--that are often impractical to obtain in real-world scenarios. Therefore, we introduce Diffusion Renderer, a neural approach that addresses the dual problem of inverse and forward rendering within a holistic framework. Leveraging powerful video diffusion model priors, the inverse rendering model accurately estimates G-buffers from real-world videos, providing an interface for image editing tasks, and training data for the rendering model. Conversely, our rendering model generates photorealistic images from G-buffers without explicit light transport simulation. Specifically, we first train a video diffusion model for inverse rendering on synthetic data, which generalizes well to real-world videos and allows us to auto-label diverse real-world videos. We then co-train our rendering model using both synthetic and auto-labeled real-world data. Experiments demonstrate that Diffusion Renderer effectively approximates inverse and forwards rendering, consistently outperforming the state-of-the-art. Our model enables practical applications from a single video input—including relighting, material editing, and realistic object insertion.