Skip to yearly menu bar Skip to main content


Tune-An-Ellipse: CLIP Has Potential to Find What You Want

Jinheng Xie · Songhe Deng · Bing Li · Haozhe Liu · Yawen Huang · Yefeng Zheng · J├╝rgen Schmidhuber · Bernard Ghanem · Linlin Shen · Mike Zheng Shou

Arch 4A-E Poster #394
award Highlight
[ ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


Visual prompting of large vision language models such as CLIP exhibit intriguing zero-shot capabilities. A manually drawn red circle, commonly used for highlighting, can guide CLIP's attention to the surrounding region, to identify specific objects within an image. Without precise object proposals, however, it is insufficient for localization. Our novel, simple yet effective approach enables CLIP to zero-shot localize: given an image and a text prompt describing an object, we first pick an initial ellipse from uniformly distributed anchor ellipses on the image grid via visual prompting, then use three loss functions to tune the ellipse coefficients to encapsulate the target region gradually. This yields promising experimental results for referring expression comprehension without precisely specified object proposals. In addition, we systematically present the limitations of visual prompting inherent in CLIP and discuss potential avenues for improvement. Code will be released.

Live content is unavailable. Log in and register to view live content