β-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment
Fatimah Zohra ⋅ Chen Zhao ⋅ Hani Itani ⋅ Bernard Ghanem
Abstract
CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long detailed captions. In this work, we propose a multi-granular text-conditioned contrastive learning framework, $\beta$-CLIP, to achieve hierarchical alignment across multiple textual granularities -- from full captions to sentences and phrases -- and their corresponding visual regions. For each level of textual granularity, $\beta$-CLIP uses cross-attention to dynamically pool image patches, producing contextualized visual embeddings. A $\beta$-weighted contrastive objective jointly optimizes multi-granular text–contextualized visual pairs, with both soft cross-entropy and hard binary cross-entropy formulations, enabling controllable intra-image competition and balanced fine-to-coarse alignment. Through extensive experiments on various benchmarks with diverse granularities, we show that $\beta$-CLIP achieves 30.9\% on FG-OVD (Hard) and, on long-text retrieval, 63.6\% I2T R@1 on DCI and 92.2\% T2I R@1 on Urban1K, reaching the state-of-the-art among methods not trained with Hard Negatives. $\beta$-CLIP establishes a strong, adaptive baseline for dense vision–language correspondence.
Successful Page Load