From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis
Abstract
In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors.Given uncalibrated and unposed multi-view images, NAS3R reconstructs 3D Gaussian primitives from context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision.To ensure stable convergence, NAS3R integrates scene reconstruction and camera estimation within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization.The framework is compatible with state-of-the-art architectures and can incorporate pretrained priors or intrinsic information when available.Extensive experiments show that NAS3R achieves superior results to other self-supervised methods, establishing a scalable and geometry-aware paradigm for 3D learning from unconstrained data.