Reevaluating the Intra-modal Misalignment Hypothesis in CLIP
Abstract
Recent research has indicated that the embeddings generated by contrastive language-image training like CLIP may not be ideal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated similarities between images.In this study, we question this intra-modal misalignment hypothesis.We reexamine the theoretical arguments and techniques that seek to demonstrate the misalignment.Our findings reveal that neither the distribution of cosine similarities nor few-shot or retrieval metrics serve as reliable indicators of misalignment.In fact, these metrics yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2), which indicates there is no intra-modal misalignment stemming from contrastive language-image training.We argue the observed phenomena can be explained without assuming a fundamental flaw in the image embedding space.Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing supposed misalignment is unnecessary for achieving strong performance.