Skip to yearly menu bar Skip to main content


The Neglected Tails in Vision-Language Models

Shubham Parashar · Tian Liu · Zhiqiu Lin · Xiangjue Dong · Yanan Li · James Caverlee · Deva Ramanan · Shu Kong

Arch 4A-E Poster #324
[ ] [ Project Page ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT

Abstract: Vision-language models (VLMs) such as CLIP excel in zero-shot recognition but exhibit drastically imbalanced performance across visual concepts in downstream tasks. For example, despite an impressive mean zero-shot accuracy on ImageNet (72.7\%), CLIP yields $<$10\% on ten concepts (e.g., ${\tt night}$ ${\tt snake}$ and ${\tt snoek}$). This is presumably because these concepts are under-represented in VLMs' pretraining datasets, which are believed to exhibit imbalanced distributions of concepts. Yet, assessing this imbalance is challenging, as calculating the frequency of specific concepts within large-scale pretraining data is not straightforward. In this work, we make the first attempt to measure the concept frequency in VLMs' pretraining data by counting relevant pretraining texts. We also use off-the-shelf language models to help count relevant texts that contain synonyms of the given concepts and resolve ambiguous cases. Our analysis confirms that visual concepts follow a long-tailed distribution in the pretraining data, which strongly correlates with per-class accuracies. Further, to mitigate VLMs' imbalanced performance in zero-shot recognition, we propose ${\bf RE}$trieval-${\bf A}$ugmented ${\bf L}$earning (REAL). First, instead of prompting VLMs using the original class names defined in a downstream task, REAL uses their most frequent synonyms found in the pretraining texts. This already outperforms human-engineered and LLM-generated prompts over nine benchmark datasets, likely because VLMs have seen more images associated with the more frequent synonyms. Second, REAL uses all the concept synonyms to retrieve a small class-balanced subset of images from the pretraining data to learn a robust classifier. REAL rivals the recent retrieval-augmented solution REACT, using $400\times$ less storage and 10,000$\times$ less training time!

Live content is unavailable. Log in and register to view live content