Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness
Abstract
Vision Transformers (ViTs) often fail under distribution shifts because they learn spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground/background masks, overlook the fine-grained semantic concepts that truly define an object (e.g., "long beak" and "wings" for a "bird"). To address this, we introduce a novel finetuning framework that steers model reasoning toward concept-level semantics. Our approach optimizes the model's internal relevance maps (via AttnLRP) to align with spatially-grounded concept masks. These guidance masks are generated automatically and without manual annotation: class-relevant concepts are first proposed using an LLM-driven, label-free method, and then segmented using a Vision-Language Model (GroundingSAM). The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas and preserving classifier confidence via a dedicated loss term. This process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out-of-distribution (OOD) benchmarks, show that our method significantly enhances model robustness across multiple ViT-based models and an additional CNN model. Furthermore, we validate that the resulting relevance maps exhibit improved alignment with semantic object parts, providing a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept-guided masks provide more effective guidance for model robustness than conventional segmentation maps, validating our hypothesis.