Think-Then-Generate: Structural Chain-of-Thought Reasoning for Consistent 3D Generation
Abstract
Recently, generating 3D assets using visual priors from pretrained diffusion models has shown remarkable results. However, due to the inherent lack of 3D geometric priors in 2D diffusion, the synthesized results often suffer from spatial hallucination and multi-view inconsistency. To address this limitation, we propose Thoughtful3D, a novel framework that enhances 3D content generation quality by introducing structural chain-of-thought (CoT) reasoning to alleviate inconsistent issues and mitigate hallucinations. Specifically, we design a dual-phase structural CoT strategy: (1) 3DBlueprint-CoT explicitly plans the 3D generation process through textual semantic parsing and logical deduction during the initialization phase. (2) 3DRefine-CoT dynamically evaluates latent inconsistencies by analyzing multiple renderings, employing a multi-round iterative refinement mechanism to suppress hallucinations and enhance cross-view consistency. To further promote consistency across views, we propose a Cross-view Semantic Appearance Alignment strategy that enhances multi-view consistency by establishing dynamic geometric associations between the same features from different viewpoints. Extensive experiments demonstrate that Thoughtful3D significantly improves the quality and consistency of generated 3D assets.