TANGO: Text-Anchored Guided Optimization for Robust Fine-tuning Vision-Language Models under Label Noise
Abstract
Fine-tuning large-scale Vision-Language Models (VLMs) is crucial for specialized tasks, but their performance is often undermined by the label noise prevalent in real-world datasets. Traditional approaches to learning with noisy labels typically rely on a self-referential loop, using a model's own predictions to correct errors. While recent VLM-specific methods have begun to leverage cross-modal information to aid in noise detection, we explore an alternative direction: using the text modality not just to identify noise, but to establish a source of ground truth that is fully independent of the training data's potentially corrupt labels.To this end, we propose \textbf{T}ext-\textbf{AN}chored \textbf{G}uided \textbf{O}ptimization (TANGO), a framework centered on ``semantic anchors"—a set of pure, immutable reference points generated from diverse text descriptions. Building upon these anchors, our approach reframes two key aspects of learning with noisy labels. First, we replace the conventional linear classifier with a parameter-free Text-Anchored Classifier, making predictions a direct, weighted consensus of the clean anchors. Second, we introduce an Anchor-Guided Refinement mechanism that validates each sample's given label against this external ground truth, providing a more robust signal for sample selection and label correction. Extensive experiments demonstrate that this approach achieves competitive and often state-of-the-art performance. Our code will be publicly available.