SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
Abstract
Existing studies on multimodal large language models (MLLMs) in spatial understanding are typically limited by fragmented assessments.This work considers a comprehensive evaluation of the spatial understanding abilities of existing MLLMs. Concretely, we make the following contributions in this paper: (i) we propose SpatialScore, the most comprehensive and diverse multimodal spatial intelligence benchmark to date, encompassing various visual data types, input modalities, and QA formats with around 5K manually verified samples across 30 distinct tasks; (ii) we construct SpatialCorpus, a large-scale training resource with 331K multimodal QA samples for supervised fine-tuning Qwen3-VL on spatial understanding; (iii) we develop SpaitalAgent, a multi-agent system incorporating 12 specialized spatial perception tools, supporting both Plan-Execute and ReAct reasoning paradigms, enabling to improve spatial reasoning in a training-free manner; and (iv) we conduct extensive evaluations on 40 representative MLLMs, revealing persistent challenges in spatial intelligence while demonstrating the effectiveness of our data-driven and agent-based solutions. All data, code, and models will be publicly available.