Poster Sun, Jun 7, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 226

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan ⋅ Jian Zhang ⋅ Renjie Li ⋅ Junge Zhang ⋅ Runjin Chen ⋅ Hezhen Hu ⋅ Kevin Wang ⋅ Peihao Wang ⋅ Huaizhi Qu ⋅ Shijie Zhou ⋅ Dilin Wang ⋅ Zhicheng Yan ⋅ Hongyu Xu ⋅ Justin Theiss ⋅ Tianlong Chen ⋅ Jiachen Li ⋅ Zhengzhong Tu ⋅ Zhangyang Wang ⋅ Rakesh Ranjan

Paper PDF

Abstract

The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has sparked interest in extending these models to 3D scenes, with the goal of human-like visual-spatial intelligence. However, achieving deep spatial understanding comparable to human capabilities remains challenging for both model design and data acquisition. Existing methods often rely on external depth sensors for geometry capture or off-the-shelf algorithms for pre-constructing 3D maps, which limits their scalability.In this work, we introduce VLM-3R, a framework for Vision-Language Models that couples 3D reconstructive instruction tuning with scalable training data curation and a new benchmark for temporal reasoning. Specifically, VLM-3R processes monocular video frames with a geometry encoder that derives implicit 3D tokens representing scene context (spatial tokens) and camera motion (view tokens). In parallel, we build a scalable data creation pipeline with over 200K 3D reconstructive instruction-tuning question-answer pairs. To evaluate temporal reasoning, we further introduce the Vision-Spatial-Temporal Intelligence benchmark (VSTI-Bench), which contains over 138.6K question-answer pairs across five distinct tasks focused on evolving spatial relationships. Extensive experiments show that VLM-3R supports robust visual-spatial reasoning and improves the understanding of temporal 3D context changes, enabling monocular 3D spatial assistance and embodied reasoning.