CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection
Abstract
Adapting CLIP to vertical domains is typically approached by novel fine-tuning strategies or by scaling up domain-specific datasets. We revisit this task from a data-centric perspective: {Can effective data selection substitute for large-scale datasets in continual pre-training (CPT)?} We introduce \textbf{CHIPS} (\textbf{C}urvature-aware \textbf{H}ybrid \textbf{I}nfluence in \textbf{P}rojection \textbf{S}ubspace), which assigns each image–text pair a utility that integrates three complementary factors aligned with three goals: \textit{faithfulness} via a curvature-aware, Newton-style alignment computed in CLIP’s end-point subspace; \textit{scalability} via an InfoNCE-aware curvature estimator with Johnson–Lindenstrauss (JL) sketching; and \textit{retention} via a selection-aware relevance weight combined with learnability to balance target adaptation against general-domain preservation. We justify this design theoretically by proving a lower‑bound guarantee on the proxy’s correlation with full‑parameter alignment and by characterizing the bias–variance trade‑offs introduced by curvature mixing and JL sketching.We evaluate CHIPS empirically across various settings: 1) CHIPS attains state-of-the-art performance among selection baselines on 17 \textbf{medical benchmarks}, matches full-dataset CPT with 30\% of the data, and outperforms half-dataset CPT using only 10\%; 2) on 31 \textbf{general-domain benchmarks}, CHIPS yields the smallest performance drop under 10--30\% data-retention budgets. Code, data, and model checkpoints will be released.