Poster Sat, Jun 6, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 39

DVGT: Driving Visual Geometry Transformer

Sicheng Zuo ⋅ Zixun Xie ⋅ Wenzhao Zheng ⋅ Shaoqing Xu ⋅ Fang Li ⋅ Shengyin Jiang ⋅ Long Chen ⋅ Zhi-xin Yang ⋅ Jiwen Lu

Paper PDF

Abstract

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, it still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Visual Geometry Transformer specifically designed for autonomous Driving (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. Finally, we use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego pose for each frame. Our DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets, including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms the other geometry prediction models on various scenarios.