CIGMA: Causal Information-Gain Mechanistic Attribution of Attention Heads in Vision Transformers
Abstract
Vision Transformers often rely on spurious background correlations rather than foreground object features. While prior model pruning approaches focus solely on improving accuracy, they lack interpretability and fail to verify whether predictions are actually made by focusing on the main foreground object, providing no causal validation of which components drive spurious behavior. We introduce CIGMA (Causal Information Gain Mechanistic Attribution), a general framework for explaining the internal computation of Vision Transformers. CIGMA provides a mechanistic, information theoretic explanation by quantifying the importance of each attention head and determining whether it supports the main object or routes spurious background cues. It ranks attention heads by measuring object versus context reliance with Jensen Shannon based information gain computed from the model's full predictive distributions after two complementary edits, removing the object region and removing the surrounding context, which reveals a spurious subnet that carries background signals and a complementary set of evidence aligned heads. Evaluated on CIFAR-10, CIFAR-100, and Tiny-ImageNet across three VLM architectures (InternVL2-26B, LLaVA-1.6, LLaVA-1.5-13B), CIGMA improves accuracy by 7.6-24.8 percentage points over unmodified models while reducing background reliance by 79.5-88.1\%, substantially outperforming all baselines, demonstrating that causal head-level interventions enable more effective spurious correlation mitigation than token pruning or retraining approaches.