Breaking the Regional Perception Bottleneck of Multimodal Large Language Models via External Reasoning Framework
Abstract
High-quality pixel-level responses remain a major bottleneck for multimodal large language models (MLLMs) in regional perception. Existing approaches generally attach regression decoders to MLLM features, achieving strong grounding performance but compromising end-to-end design and increasing training costs. Researchers have applied parameter and data scaling to improve pure MLLMs’ ability to generate pixel coordinates in natural language, yet the performance gains on grounding tasks remain markedly weaker than those in standard QA tasks. Our analysis shows the primary bottleneck is that conventional scaling fails to effectively enhance the key reasoning stage required for pixel-level regional perception. To address this, we propose R-Ground, a reasoning framework for MLLM-based grounding built upon a multimodal Monte Carlo Tree Search algorithm. R-Ground leverages structured reasoning actions, multimodal feature alignment scoring, and regional feature weighted voting to perform scaling at the designated reasoning stage. Extensive experiments demonstrate that R-Ground achieves effective reasoning scaling, enabling a 7B MLLM to match or even surpass a 72B model on the grounding task. The code will be released upon acceptance.