Learning from Itself: Mining Internal Knowledge from Vision Language Models for Continual Learning
Abstract
Vision-language models like CLIP excel at zero-shot recognition but struggle with continual learning due to two critical issues: (1) severe distribution gap between pretraining captions and post-training class names, and (2) performance mismatch between vision-only and dual-encoder approaches—vision-only methods achieve 20% higher accuracy on fine-grained tasks while CLIP dominates on natural images. We propose Learning from Itself (LfI), which mines CLIP's internal knowledge to address both challenges. First, we generate pseudo-captions by optimizing learnable tokens to minimize CLIP's contrastive loss, creating auxiliary training signals that bridge the pretraining-finetuning distribution gap without external models. Second, we introduce adaptive mutual distillation that dynamically weights knowledge transfer between CLIP's text encoder and a temporary vision classifier based on their instantaneous performance—stronger branches teach more, weaker ones learn more. At inference, only the original CLIP architecture is used, having absorbed discriminative knowledge from both branches. LfI achieves state-of-the-art results across multiple continual learning benchmarks, demonstrating that CLIP can effectively teach itself to continually learn new tasks.