Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
Abstract
Employing multimodal large language models (MLLMs) in 3D physical environments demands complex spatial reasoning capabilities that integrate geometric understanding, viewpoint synthesis, fine-grained perception, and robust depth estimation. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We address this gap through the lens of wide-baseline matching (WBM)---determining whether two views with large viewpoint changes, appearance shifts, and occlusions depict the same scene element. We introduce ReasonMatch-Bench, a comprehensive benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios. Our evaluation reveals substantial gaps between human performance and state-of-the-art MLLMs, particularly for smaller models, highlighting critical deficiencies in spatial reasoning. To bridge this gap, we propose a scalable data generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora (RGB-D videos and SfM reconstructions), providing diverse, verifiable supervision. Leveraging verifiable matching accuracy as rewards, we introduce Dynamic Correspondence Reinforcement Learning (DCRL), combining Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to enable progressive acquisition of sophisticated spatial reasoning without explicit supervision. Extensive experiments demonstrate that our approach significantly enhances MLLMs' spatial reasoning capabilities, narrowing the gap with human performance on complex 3D understanding tasks.