VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale
Sven Elflein ⋅ Ruilong Li ⋅ Sérgio Agostinho ⋅ Žan Gojčič ⋅ Laura Leal-Taixe ⋅ Qunjie Zhou ⋅ Aljoša Ošep
Abstract
We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training.VGG-T$^3$ ($\mathbf{V}$isual $\mathbf{G}$eometry $\mathbf{G}$rounded $\mathbf{T}$est $\mathbf{T}$ime $\mathbf{T}$raining) scales linearly w.r.t. the number of input views, similar to online models, and achieves a $11.6\times$ speed-up over baselines that rely on softmax attention for reconstructing a $1k$ image collection in just $54$ seconds. Because our method retains global scene aggregation capability, our resulting point map reconstruction error is comparable to VGGT.
Successful Page Load