Subspace Alignment for CLIP-based Continual Learning via Canonical Correlation Analysis
Abstract
Recent advances in CLIP-based continual learning have shown the potential of leveraging pre-trained vision–language models for sequential tasks. However, existing methods overlook a key problem we call Asymmetric Drift. In unimodal CLIP-based continual learning, the visual branch undergoes stronger adaptation because the visual distribution shifts significantly, whereas the text branch remains relatively stable due to the low variance of textual prompts. This imbalance increases the modality distance and degrades cross-modal alignment over time.To address this issue, we propose CCA-CL, a framework that accumulates visual-textual covariance statistics across tasks and solves Canonical Correlation Analysis to compute a shared subspace. In this subspace, the distance between visual and textual features is minimized, enabling better alignment without modifying CLIP parameters. This also makes our method naturally compatible with exemplar-free CL settings.To further capture nonlinear relationships that linear Canonical Correlation Analysis hard to model, we introduce Random Fourier Projection as an extension.Experimental results demonstrate that CCA-CL effectively mitigates the asymmetric drift problem and achieves state-of-the-art performance on several benchmarks. Our code will be available.