ViLearn: Accelerating Training Convergence of Image-to-3D Generation via Visibility Learning
Rui Chen ⋅ Jianfeng Zhang ⋅ Jing Lin ⋅ Xuanyu Yi ⋅ Yixun Liang ⋅ Guan Luo ⋅ Xiu Li ⋅ Zeming Li ⋅ Ping Tan
Abstract
Single-image-to-3D shape generation has seen remarkable progress, driven by latent diffusion models trained on the compressed latent space of 3D VAEs. However, the task remains intrinsically ill-posed: recovering complete 3D geometry—especially occluded surfaces—from a single view is inherently ambiguous. Existing VecSet-based approaches further exacerbate this challenge by treating shape tokens as an unordered set without explicit positional encoding. This design forces diffusion models to simultaneously learn visible correspondences from the input image and hallucinate invisible geometry within a large, permutation-invariant token space, where the lack of structure significantly hinders training efficiency and convergence stability.To address this, we propose \textit{Visibility Learning}, a training paradigm that injects visibility structure and positional inductive bias into the image-to-3D pipeline. Our method comprises two synergistic components: (1) \textit{Visibility Grouping} (VG), which explicitly partitions VecSet tokens into visible and invisible subsets by exploiting the spatial locality of VecSet VAE decoders; and (2) \textit{Visibility-Aware Positional Encoding} (VAPE), which assigns shared positional embeddings to image tokens and visible shape tokens to amplify their correspondence, while using distinct encodings for invisible tokens to guide hallucination. By explicitly disentangling visible reconstruction from invisible hallucination, our approach shrinks the effective hypothesis space and provides clear structural guidance for diffusion models. Extensive experiments demonstrate that \textit{Visibility Learning} accelerates training convergence by up to \textcolor{red}{4.4$\times$} while achieving superior generation quality compared to strong VecSet-based baselines.
Successful Page Load