Skip to yearly menu bar Skip to main content


Poster

Robust Multi-Object 4D Generation for In-the-wild Videos

Wen-Hsuan Chu · Lei Ke · Jianmeng Liu · Mingxiao Huo · Pavel Tokmakov · Katerina Fragkiadaki


Abstract:

We address the challenging problem of generating a dynamic 4D scene across views and over time from monocular videos. We target in-the-wild multi-object videos with heavy occlusions and propose Robust4DGen, a model that decomposes the scene into object tracks and optimizes a differentiable and deformable set of 3D Gaussians for each. Robust4DGen captures 2D occlusions from a 3D perspective by jointly splatting Gaussians of all objects to compute rendering errors in observed frames. Rather than relying on scene-level view generation models, which struggle to generalize due to the combinatorial complexity of scene views, we keep the Gaussian grouping information and additionally utilize object-centric, view-conditioned generative models for each entity to optimize score distillation objectives from unobserved viewpoints. We achieve this by applying differentiable affine transformations to jointly optimize both global image re-projection and object-centric score distillation objectives within a unified framework. To enable a thorough evaluation of generation and motion accuracy under multi-object occlusions, we annotate MOSE-PTS with accurate 2D point tracks, which is a subset of the challenging MOSE video segmentation benchmark. Through quantitative analysis and human evaluation, we demonstrate that our method generates more realistic 4D multi-object scenes and produces more accurate point tracks across spatial and temporal dimensions compared to existing approaches.

Live content is unavailable. Log in and register to view live content