CLIP-like Model as a Foundational Density Ratio Estimator
Abstract
Density ratio estimation is a core concept in statistical machine learning because it provides a unified mechanism for tasks such as importance weighting, divergence estimation, and likelihood-free inference, but its potential in vision and language models has not been fully explored. Modern vision-language encoders such as CLIP and SigLIP are trained with contrastive objectives that implicitly optimize log density ratios between joint and marginal image–text distributions, which implicitly learn similarity scores proportional to log density ratios. However, prior work has largely focused on their embedding utility, and the density-ratio structure induced by contrastive learning has not been systematically examined or exploited in multimodal applications.To address this gap, we reinterpret CLIP-style models as pretrained and general-purpose density ratio estimators and show that this perspective enables new algorithmic capabilities.We present a unified explanation of how contrastive objectives estimate density ratios and propose two practical applications: Importance Weight Learning and KL divergence estimationOur Importance Weight Learning method requires only a single additional prompt and improves F1 scores by up to 7 points.We further show that CLIP-based density ratios support estimation of KL divergences that quantify how conditioning on an image or text alters the distribution of the other modality. Through qualitative examples and an N-gram analysis of captions, we find that these divergences capture semantic diversity and mode structure in multimodal data. Leveraging this property, we introduce a simple KL-guided data curation method that achieves performance competitive with LAION2B filtering.Our code will be publicly available.