Skip to yearly menu bar Skip to main content


SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen · Zhuo Xu · Sean Kirmani · brian ichter · Dorsa Sadigh · Leonidas Guibas · Fei Xia

Arch 4A-E Poster #464
[ ] [ Project Page ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


Understanding and reasoning about spatial relationships is crucial for Visual Question Answering (VQA) and robotics. Vision Language Models (VLMs) have shown impressive performance in some VQA benchmarks but struggle with 3D spatial reasoning, such as recognizing distances or size differences between physical objects. This limitation may stem from a lack of 3D spatial knowledge in their training data. To address this, we propose training VLMs with extensive spatial reasoning data from the internet. Our approach includes developing an automatic 3D spatial VQA data generation framework, capable of creating 2 billion VQA examples from 10 million real-world images. We explore various factors in the training process, such as data quality, training pipeline, and VLM architecture. Our work introduces the first Internet-scale 3D spatial reasoning dataset in metric space. By co-training a VLM with this dataset, we significantly improve its performance in both qualitative and quantitative spatial VQA. Additionally, this enhanced VLM enables new applications in chain-of-thought spatial reasoning and robotics, particularly in quantitative estimation.

Live content is unavailable. Log in and register to view live content