Poster
Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment
Mayug Maniparambil · Raiymbek Akshulakov · YASSER ABDELAZIZ DAHOU DJILALI · Sanath Narayan · Ankit Singh · Noel O'Connor
Recent contrastive multimodal vision-language models like CLIP have demonstrated robust open-world semantic understanding, becoming the standard image backbones for vision-language applications. However, recent findings suggest high semantic similarity between well-trained unimodal encoders, which raises a key question: Are semantically similar embedding spaces separated only by simple projection transformations? To validate this, we propose a novel framework that aligns vision and language using frozen unimodal encoders. It involves selecting semantically similar encoders in the latent space, curating a concept-rich dataset of image-caption pairs, and training simple MLP projectors. We evaluated our approach on various tasks involving both strong unimodal vision (0-shot localization) and language encoders (multi-lingual, long context) and show that simple Projectors retain unimodal capabilities in joint embedding space. Furthermore, our best model, utilizing DINOv2 and All-Roberta-Large text encoder, achieves 76(\%) accuracy on ImageNet with a 20-fold reduction in data and 65-fold reduction in compute requirements compared to multimodal alignment where models are trained from scratch. The proposed framework enhances the accessibility of multimodal model development while enabling flexible adaptation across diverse scenarios. Code and curated datasets will be released soon
Live content is unavailable. Log in and register to view live content