Poster Fri, Jun 5, 2026 • 9:45 AM – 11:45 AM PDT ExHall A-F 546

Extend3D: Town-Scale 3D Generation

Seungwoo Yoon ⋅ Jinmo Kim ⋅ Jaesik Park

Abstract

In this paper, we propose Extend3D, a novel training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces of object-centric models in representing wide scenes, we extend the latent space in $x$ and $y$ directions. Then, by dividing the extended latent into overlapping patches, we use the object-centric 3D generative model on each patch and couple them at each time step. Since object-centric models are sub-optimal for sub-scene generation, we use the input image and point cloud extracted from a depth estimator as priors to enable this process. Using the point cloud prior, we initialize the scene structure and refine the occluded region iteratively with under-noised SDEdit. Also, both priors are used to optimize the extended latent during the denoising process so that the denoising paths do not deviate from the sub-scene dynamics. We demonstrate that our method produces better results than previous methods, as evidenced by human preferences.