Photo3D: Advancing Photorealistic 3D Generation through Structure‑Aligned Detail Enhancement
Abstract
Although recent 3D‑native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich surface details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non‑rigid motions of objects, and the limited precision of scanners.We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT‑4o‑Image model.Considering that the generated images can distort 3D structures due to their lack of multi‑view consistency, we design a structure‑aligned multi‑view synthesis pipeline and construct a detail‑enhanced multi‑view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages perceptual feature adaptation and semantic structure matching to enforce appearance consistency with the realistic detail priors while preserving the structural consistency with the 3D-native geometry. While our scheme is general to different 3D-native generators, we present dedicated training strategies to facilitate the optimization of geometry-texture coupled and decoupled 3D-native generation paradigms. Experiments demonstrate that Photo3D generalizes well across diverse 3D‑native generation paradigms and achieves state‑of‑the‑art photorealistic 3D generation performance. Codes, models and datasets will be released.