Poster Sat, Jun 6, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 374

Envision, Attend, Then Respond: Counterfactual Hallucination Mitigation in Large Vision-Language Models

Yuxuan Liang ⋅ Fan Shi ⋅ Rui Zhu ⋅ Xu Li ⋅ Xiaolei Chen ⋅ Zhe Liu ⋅ Bin Li ⋅ Xiangyang Xue

Abstract

Large Vision-Language Models (LVLMs) often hallucinate when visual evidence conflicts with world knowledge, i.e., in counterfactual scenarios. We propose Envision-Attend-Respond (EnAR), a training-free framework that leverages visual priors to steer the model's attention toward counterfactual elements in the image. The Envision stage constructs a visual impression by invoking a diffusion prior to perform latent perturbations, yielding a prior-consistent counterpart of the input image. The Attend stage processes the original image and its visual impression through the LVLM's vision encoder to localize counterfactual elements, forming a corresponding padded input. The Respond stage performs contrastive decoding between the original and padded inputs to suppress bias and enhance visual understanding. Empirically, EnAR consistently mitigates hallucinations and improves response fidelity, achieving a 10.82\% gain on VLMBias and an average 6.9\% improvement on POPE, demonstrating robustness across both counterfactual and general hallucination settings. Moreover, the framework remains effective across heterogeneous LVLM architectures, offering a new perspective for hallucination governance in multimodal reasoning.