CASPA: Graph-Structured Concept Anchors for Modality-Agnostic Adaptation in Vision–Language Models
Abhiroop Chatterjee ⋅ Susmita Ghosh ⋅ Ashish Ghosh ⋅ Emmett Ientilucci
Abstract
Recent advances in vision–language models (VLMs) have revealed both the promise and the rigidity of large-scale pretraining. Despite their impressive zero-shot generalization, existing adaptation paradigms—whether prompt tuning, adapter injection, or fine-tuning—remain class-specific, modality-biased, and structure-agnostic. However, these design choices limit reasoning-level transfer across tasks. To this end, we rethink adaptation as a shared conceptual structure rather than a per-class specialization. We propose $\textbf{CASPA}$ (Concept-Anchored Semantic Prompt Adapter), a dual-anchor semantic adapter that jointly learns shared text and image anchors as a bidirectional conceptual interface between modalities. Each class learns a soft association distribution over these anchors, producing compositional representations that enable parameter sharing and semantic reuse. To further align visual and textual reasoning spaces, CASPA employs $\textbf{Semantic Cross-Consistency Regularization (S-XCR)}$, enforcing geometric and semantic agreement between text- and image-conditioned anchor mixtures. To the best of our knowledge, this is the first work to jointly model graph-structured semantic adaptation and cross-modal regularization for unified, reasoning-level vision–language alignment. CASPA is evaluated across four adaptation regimes—base-to-novel generalization, few-shot learning under data scarcity, cross-data transfer, and backbone-agnostic few-shot evaluation. Evaluated on eleven diverse visual recognition datasets, it matches or outperforms several state-of-the-art methods.
Successful Page Load