Skip to yearly menu bar Skip to main content


Exploring Regional Clues in CLIP for Zero-Shot Semantic Segmentation

Yi Zhang · Meng-Hao Guo · Miao Wang · Shi-Min Hu

Arch 4A-E Poster #302
[ ] [ Project Page ]
Wed 19 Jun 10:30 a.m. PDT — noon PDT


CLIP has demonstrated marked progress in visual recognition due to its powerful pre-training on large-scale image-text pairs. However, it still remains a critical challenge: how to transfer image-level knowledge into pixel-level understanding tasks such as semantic segmentation. In this paper, to solve the mentioned challenge, we analyze the gap between the capability of the CLIP model and the requirement of the zero-shot semantic segmentation task. Based on our analysis and observations, we propose a novel method for zero-shot semantic segmentation, dubbed CLIP-RC (CLIP with Regional Clues), bringing two main insights. On the one hand, a region-level bridge is necessary to provide fine-grained semantics. On the other hand, overfitting should be mitigated during the training stage. Benefiting from the above discoveries, CLIP-RC achieves state-of-the-art performance on various zero-shot semantic segmentation benchmarks, including PASCAL VOC, PASCAL Context, and COCO-Stuff 164K. Code will be available at

Live content is unavailable. Log in and register to view live content