Draft and Refine with Visual Experts
Abstract
While recent Large Vision–Language Models (LVLMs) exhibit impressive multimodal reasoning abilities, they often produce ungrounded, hallucinated responses by over-relying on linguistic priors rather than visual evidence. This critical limitation arises from the lack of a quantitative measure of how much these models actually rely on visual inputs during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a novel question-conditioned utilization metric. This metric quantifies the model’s actual reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific evidence, and then assessing dependence through relevance-based probabilistic masking. Guided by this metric, the DnR agent refines its initial draft through targeted feedback from external visual experts. Each expert’s output (e.g., boxes, masks) is rendered as visual cues on the image, and the VLM is re-queried to select the response that yields the greatest improvement in utilization. This process strengthens visual grounding of predictions without retraining or architectural changes. Experiments across a broad range of VQA and captioning benchmarks demonstrate consistent accuracy gains and reduced hallucination. These results show that quantifying visual utilization provides a principled path for designing more interpretable and evidence-driven multimodal agent systems that effectively leverage visual experts.