DVGT: Visual Geometry Transformer for Autonomous Driving
Abstract
Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, it still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Visual Geometry Transformer specifically designed for autonomous Driving (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. Finally, we use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego pose for each frame. Our DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets, including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms the other geometry prediction models on various scenarios.