SpatialTree: How Spatial Intelligence Branches Out in MLLMs
Abstract
Spatial Intelligence (SI) has emerged as a critical frontier for MLLMs, encompassing a hierarchy of skills from foundational perception to high level spatial reasoning. However, how these abilities are acquired, emerge, and transferred remains largely unknown. To investigate this, we propose SpatialTree a hierarchical taxonomy that organizes SI into a capability tree—from low level perception (L1), mental mapping (L2), mental simulation (L3), to agentic competence (L4). Building on this, we construct a hierarchical, capability-centric benchmark using our proposed Spatial Engine, annotating each ability according to its level. Guided by the benchmark's correlation analysis, we conduct targeted supervised fine-tuning (SFT) and prompting experiments on key abilities. The results confirm the independence of abilities at the same level, reveal cross-level transfer, and further demonstrate a multi-ability synergy when these abilities are trained jointly. Our work provides a novel framework for analyzing SI in MLLMs, offering a comprehensive methodology to study how foundational abilities emerge and support higher-level competencies.