VGGT-$\Omega$
Jianyuan Wang ⋅ Minghao Chen ⋅ Shangzhan Zhang ⋅ Nikita Karaev ⋅ Johannes Schönberger ⋅ Patrick Labatut ⋅ Piotr Bojanowski ⋅ David Novotny ⋅ Andrea Vedaldi ⋅ Christian Rupprecht
Abstract
We present VGGT-Ω, a feed-forward model for 3D reconstruction that substantially advances the state of the art in accuracy, efficiency, and capability for both static and dynamic scenes. Prior models such as VGGT have shown that feed-forward 3D reconstruction can already be competitive with traditional optimization-based methods. Here, we further demonstrate that the accuracy and robustness of these models scale predictably with model capacity and data size. To enable training 3D reconstruction models at an unprecedented scale, we introduce a high-quality data annotation pipeline that handles dynamic scenes, a self-supervised learning protocol, and architectural changes that greatly reduce memory requirements. We significantly simplify VGGT’s architecture by replacing multiple dense prediction heads with loss-driven multitask learning, removing unstable DPT blocks, and introducing more efficient global attention via scene tokens. These changes allow us to efficiently train VGGT-Ω with 20$\times$ more supervised data and 100$\times$ more unsupervised data than prior work, while requiring only 30% of VGGT’s memory and running 1.6$\times$ faster at inference. As a result, VGGT-Ω establishes a new state of the art for 3D reconstruction on both static and dynamic scenes across a wide range of benchmarks, e.g., improving the camera estimation accuracy by 77% on the Sintel dataset. Models and code will be publicly released.
Successful Page Load