VES-RFT: Rewarding Visual Evidence Sensitivity to Mitigate Hallucinations in Large Vision–Language Models
XUEGE HOU ⋅ Wenshuo Li ⋅ Yali Li ⋅ Han Shu ⋅ Yuan Wang ⋅ Xinghao Chen ⋅ Shengjin Wang
Abstract
Vision-Language Models (VLMs) often over-rely on linguistic priors even when images are provided, leading to object hallucinations. We revisit object-wise hallucination from the perspective of how visual evidence shapes the model’s uncertainty. For each input, we measure decision uncertainty with and without the image, and define a Visual Evidence Sensitivity (VES) signal as the image-attributable change in entropy. Building on this signal, we introduce Visual Evidence Sensitivity Reinforcement Fine-Tuning (VES-RFT), a training-time reinforcement fine-tuning method that explicitly rewards reliance on correct visual evidence. We pair this continuous, annotation-free signal with a verifiable reward that enforces factual object correctness by automatically checking generated object mentions against the image, yielding a computable objective without human annotations. We optimize the dual objective using critic-free GRPO with KL regularization, requiring only parallel image and no-image passes during training while preserving single-pass inference. Across multiple VLM families and benchmarks, VES-RFT consistently suppresses hallucinations and improves robustness under ambiguity without degrading general language ability. Specifically, on LLaVA-7B, VES-RFT reduces 12.8 and 1.8 on CHAIR$_S$/CHAIR$_I$ of MS-COCO, and increases POPE accuracy by 4.92\%. Extensive experiments indicate that turning uncertainty into a learnable reward, paired with verifiable correctness signals, provides a scalable mechanism for training-time hallucination mitigation and stronger visual grounding.
Successful Page Load