Skip to yearly menu bar Skip to main content


Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

Luo Jiayun · Siddhesh Khandelwal · Leonid Sigal · Boyang Li

Arch 4A-E Poster #373
[ ]
Wed 19 Jun 10:30 a.m. PDT — noon PDT


From an enormous amount of image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which is vital for tasks such as image captioning and visual question answering. However, leveraging such pre-trained models for open-vocabulary semantic segmentation remains a challenge.In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss to produce semantic segmentation. However, cross-attention alone tends to over-segment, whereas cross-attention plus GradCAM tend to under-segment. To alleviate this issue, we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, we are able to better resolve the entire extent of the segmentation mask. Compared to existing techniques, the proposed method does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set. PnP-OVSS demonstrates substantial improvements over a comparable baseline (+29.4\% on Pascal VOC, +13.2\% on Pascal Context, +14.0\% mIoU on MS COCO, +2.4\% on COCO Stuff) and even outperforms most baselines that conduct additional network training on top of pretrained VLMs.

Live content is unavailable. Log in and register to view live content