3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
Abstract
We introduce 3D-Fixer, a novel generalizable and efficient scheme for single-image to compositional 3D scene generation. Unlike existing feed-forward frameworks that lack generalization ability in open-set scenarios due to the limited dataset, or divide-and-conquer frameworks that suffer from slow inference or accumulated registration errors during layout alignment, 3D-Fixer extends pre-trained object-level 3D generation priors to perform in-place completion on the single-view estimated geometry, eliminating the need for pose alignment while preserving feed-forward efficiency. At its core, 3D-Fixer introduces a coarse-to-fine scheme to accurately determine the completion boundary and generate high quality completion 3D asset based on the single-view estimated fragmented geometry. Also, we design a dual-branch conditioning network that integrates 2D and 3D contextual information to guide the pre-trained object generation priors for in-place completion. Furthermore, we introduce the Occlusion-Robust Feature Alignment strategy, which employs feature distillation to stabilize the training of the generative priors under occlusion scenarios. Existing scene-level dataset, either suffering from limited scale or lacking accurate per-instance ground truth, severely restricting the development of scene generation approaches. Therefore, we constructed the large-scale scene-level dataset, featuring over 110K diverse scenes and 3M images with complete 3D asset ground truth and accurate placement annotation. Experiments demonstrate that 3D-Fixer achieves state-of-the-art geometric accuracy while maintaining an inference speed comparable to feed-forward estimation methods, vastly outperforming iterative optimization approaches. Our dataset and trained models will be publicly available upon acceptance.