Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
Abstract
Recent advancements in feed-forward architectures for visual geometry estimation have achieved significant progress. Interestingly, per-frame visual geometry estimation approaches typically exhibit weaker multi-frame consistency but demonstrate superior per-frame accuracy compared to multi-frame algorithms. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals three key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that enables high-resolution geometry estimation. These contributions are integrated into CFG, a model that simultaneously generates precise and coherent geometric representations from diverse input perspectives at high resolutions. Comprehensive testing across multiple benchmarks for point cloud reconstruction, video depth estimation, and camera pose/intrinsic parameter estimation confirms CFG's superior performance, establishing it as a state-of-the-art solution for visual geometry tasks.