Reconstructing CLIP for Open-Vocabulary Dense Perception
Yajie Liu ⋅ Jinjin Zhang ⋅ Qingjie Liu ⋅ Di Huang
Abstract
Large-scale vision–language models (VLMs) such as CLIP have excelled in zero-shot image classification, yet they struggle to achieve the dense cross-modal alignment required by open-vocabulary dense perception (OVDP). While recent self-distillation methods address this by aligning dense features with the generalizable global semantics, a key question remains: how should such dense features be constructed to achieve optimal alignment? To address this, we propose DenseRC, a principled $\textbf{Dense}$ $\textbf{R}$epresentations $\textbf{C}$onstruction framework that reconstructs CLIP for OVDP based on two key insights.First, by analyzing the internal semantics encoded in the global $\textit{cls}$ token, we identify that multi-layer value embeddings serve as an informative basis for dense features. Second, we reveal that spatial aggregation tends to amplify semantic misalignment. Motivated by this, we design a lightweight Head-Selective Gating (HSG) module that adaptively reweights feature heads according to their intrinsic heterogeneity, enabling discriminative and alignment-friendly dense representations construction. Extensive experiments demonstrate that DenseRC delivers consistent and substantial gains across OVDP tasks including object detection and semantic segmentation, setting new state-of-the-art performance on multiple benchmarks.
Successful Page Load