Poster Sun, Jun 7, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 446

VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models

Soumya Suvra Ghosal ⋅ Youngeun Kim ⋅ Zhuowei Li ⋅ Ritwick Chaudhry ⋅ Linghan Xu ⋅ Hongjing Zhang ⋅ Jakub Zablocki ⋅ Yifan Xing ⋅ Qin ZHANG

Abstract

Advances in large reasoning models have shown strong performance on complex reasoning tasks by scaling test-time compute through extended inference-time thinking. However, recent studies observe that in vision-dependent tasks, extended textual reasoning at inference-time can often degrade performance as models progressively lose attention to visual tokens, increasingly relying on textual priors alone. To address this, prior works used reinforcement learning (RL)-based fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of inference-time compute without additional RL fine-tuning, we propose VisRef, a visually grounded test-time scaling framework. Our key idea is to actively guide the reasoning process through re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context yet diverse and globally representative of the image for more grounded multi-modal reasoning. Experiments on three visual reasoning benchmarks with state-of-the-art multi-modal large reasoning models demonstrate that under fixed inference-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.