Boosting Visual Reprogramming for CLIP with Dual Granularity Alignment
Abstract
Model reprogramming adapts pretrained models to downstream tasks by modifying their input and output spaces. Visual reprogramming (VR), a prominent instance, introduces learnable input transformations (e.g., visual prompts) to repurpose vision-language models like CLIP for downstream visual tasks. Existing VR methods primarily focus on single-level alignment between prompted images and text descriptions, overlooking inherent structural information in data that facilitates alignment: semantic granularity (label hierarchies) and visual granularity (multi-scale representations). To address this gap, we propose Dual Granularity Alignment (DGA): First, for visual granularity, we generate multi-scale images and propose Uncertainty-calibrated Prediction Fusion (UPF) to capture hierarchical spatial information within images. Second, for semantic granularity, we propose Prototype-guided Label Hierarchization (PLF) to construct category hierarchies from visual semantic similarities and propose Hierarchical Knowledge Propagation (HKP) to achieve top-down superclass-to-subclass guidance for coherent multi-level visual prompts alignment. Our DGA collaboratively integrate both granularities to enhance alignment effectiveness. Experiments across 12 downstream datasets demonstrate DGA's superiority over baselines on both ViT-based and ResNet-based CLIP architectures. Specifically, DGA achieves a 4.5% improvement over the previous state-of-the-art method on ViT-16-based CLIP. By explicitly modeling structural granularities, DGA establishes a new paradigm for visual reprogramming.