Poster Sat, Jun 6, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 670

Grounded Latents for Entity-Centric 4D Scene Generation

Jinhyung Park ⋅ Navyata Sanghvi ⋅ Erica Weng ⋅ Shawn Hunt ⋅ Shinya Tanaka ⋅ Hironobu Fujiyioshi ⋅ Kris Kitani

Abstract

Although recent work has explored generative modeling of 3D or 4D driving scenes, most approaches operate on dense voxel-based representations, which are computationally expensive and struggle to maintain temporal or structural consistency. These methods often produce blurred or merged entities (i.e., cars, trucks, pedestrians) and lack fine-grained control over individual scene elements. We propose to perform generative modeling in a compact, entity-centric latent space, where each grounded 3D latent represents a semantically meaningful local region of the scene. This formulation enables precise, consistent control of both foreground and background elements while preserving geometric detail. We further extend this representation to 4D by learning a motion diffusion model for both ego and dynamic actors, conditioned on the generated 3D scene, and by propagating the grounded latents through time. Our framework produces physically consistent and temporally coherent 4D scenes, supporting controllable and realistic generation.