Text-Image Conditioned 3D Generation
Abstract
High-quality 3D assets are critical for VR/AR, industrial design, and entertainment, driving growing interest in generative models that can create 3D content from user-provided prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models deliver high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, whereas text-conditioned models benefit from broad semantic guidance yet lack low-level visual detail. This restricts how users can express their intent and raises a natural question: can the two modalities be combined to yield more flexible and faithful 3D generation? Our diagnostic study shows that even a simple late fusion of text- and image-conditioned predictions improves over single-modality models, evidencing strong cross-modal complementarity. Building on this finding, we formalize the task of Text–Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification during generation. To address this task, we introduce TIGON, a minimalist dual-branch baseline that maintains separate image- and text-conditioned backbones with lightweight cross-modal fusion. Extensive experiments demonstrate that text–image conditioning yields consistent gains over single-modality methods, suggesting complementary vision–language guidance as a promising direction for future 3D generation research.