EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
Abstract
Multimodal large language models (MLLMs) are increasingly applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most concurrent works enhance spatial reasoning by introducing 3D priors or geometric supervision, which improves performance but incurs substantial data preparation and alignment costs. Purely 2D approaches, however, struggle with multi-frame spatial reasoning due to missing viewpoint transitions and overlooked implicit objects that act as spatial bridges. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Captioning and Progressive Spatial Analysis, jointly constructing a coherent linguistic scene graph across frames. With only 5K auto-generated SFT samples and 20K RL samples, EgoMind achieves competitive results on VSI-Bench, SPAR-Bench, STI-Bench and SPBench, demonstrating its effectiveness in reinforcing the spatial reasoning capabilities of MLLMs and highlighting the potential of linguistic reasoning for spatial cognition.