VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models
Abstract
Advances in large reasoning models have shown strong performance on complex reasoning tasks by scaling test-time compute through extended inference-time thinking. However, recent studies observe that in vision-dependent tasks, extended textual reasoning at inference-time can often degrade performance as models progressively lose attention to visual tokens, increasingly relying on textual priors alone. To address this, prior works used reinforcement learning (RL)-based fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of inference-time compute without additional RL fine-tuning, we propose VisRef, a visually grounded test-time scaling framework. Our key idea is to actively guide the reasoning process through re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context yet diverse and globally representative of the image for more grounded multi-modal reasoning. Experiments on three visual reasoning benchmarks with state-of-the-art multi-modal large reasoning models demonstrate that under fixed inference-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.