STAR-R1: Multi-View Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
Abstract
Multimodal Large Language Models (MLLMs) remain far from human-level performance in multi-view spatial reasoning, where models must establish object correspondences across view and infer coherent scene semantics. We analyze this limitation through the Transformation-Driven Visual Reasoning (TVR) task and find that Supervised Fine-Tuning (SFT) fails to capture cross-view consistency, whereas reinforcement learning (RL) fails to reliably identify key referential objects. To bridge this gap, we introduce multi-View Spatial TrAnsformation Reasoning (STAR-R1), a two-stage framework that combines process-supervised SFT with a referential-aware RL paradigm. STAR-R1 first learns structured spatial reasoning trajectories from high-quality CoTs and then uses fine-grained rewards on referential selection and answer correctness to encourage effective exploration and robust scene interpretation. Despite using only a small amount of high-quality training data, STAR-R1 surpasses state-of-the-art models with far more training data on the multi-view spatial understanding benchmarks TVR, MMSI-Bench, MindCube-Bench, and SPAR-Bench. Our study reveals the overlooked potential of RL in multi-view spatial understanding and points a way toward potentially achieving more human-like spatial reasoning in MLLMs.