Long-Tail Internet Photo Reconstruction
Abstract
Internet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed, while most real-world sites contain only sparse, noisy, and uneven imagery that defeats classical and learned 3D methods. Existing 3D foundation models generalize well to curated datasets but collapse under the sparsity, ambiguity, and irregularity of Internet photos. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large-scale, clean, and depth-refined dataset, together with a sparse-aware sampling strategy that mimic camera distributions in long-tail scenes. Finetuning 3D foundation models with these components yields robust reconstructions under extreme sparsity, demonstrating emergent symmetry disambiguation while preserving generalization to standard 3D benchmarks.