Envision, Attend, Then Respond: Counterfactual Hallucination Mitigation in Large Vision-Language Models
Abstract
Large Vision-Language Models (LVLMs) often hallucinate when visual evidence conflicts with world knowledge, i.e., in counterfactual scenarios. We propose Envision-Attend-Respond (EnAR), a training-free framework that leverages visual priors to steer the model's attention toward counterfactual elements in the image. The Envision stage constructs a visual impression by invoking a diffusion prior to perform latent perturbations, yielding a prior-consistent counterpart of the input image. The Attend stage processes the original image and its visual impression through the LVLM's vision encoder to localize counterfactual elements, forming a corresponding padded input. The Respond stage performs contrastive decoding between the original and padded inputs to suppress bias and enhance visual understanding. Empirically, EnAR consistently mitigates hallucinations and improves response fidelity, achieving a 10.82\% gain on VLMBias and an average 6.9\% improvement on POPE, demonstrating robustness across both counterfactual and general hallucination settings. Moreover, the framework remains effective across heterogeneous LVLM architectures, offering a new perspective for hallucination governance in multimodal reasoning.