CausalLens: Sensitivity-Guided Multi-Head Causal Intervention for Hallucination Mitigation in Large Vision-Language Models
Abstract
Recent Large Vision-Language Models (LVLMs) have shown impressive capabilities in multimodal understanding and generation. Despite this progress, they remain prone to hallucination, where model outputs conflict with the visual input due to an over-reliance on textual priors. Existing inference-time mitigation approaches frequently depend on multi-pass or contrastive decoding, which increases latency and limits their applicability in real-time settings. To address this limitation, we propose CausalLens, a training-free and single-pass intervention that directly adjusts the decoder hidden states to strengthen visual grounding. By decomposing attention heads into visual, textual, and system prompt pathways, CausalLens identifies visually reliable heads using a sensitivity measure and selectively adjusts their mid-layer hidden-state contributions. A projection-aligned correction further stabilizes these adjusted states after multi-head fusion, ensuring that the enhanced visual information is preserved throughout decoding. Extensive experiments across multiple hallucination benchmarks and LVLM architectures demonstrate that CausalLens consistently improves visual fidelity while adding negligible computational overhead. The method requires no fine-tuning or architectural changes, making it well-suited for practical, latency-sensitive applications.