Skip to yearly menu bar Skip to main content


Poster

Visual In-Context Prompting

Feng Li · Qing Jiang · Hao Zhang · Shilong Liu · Huaizhe Xu · Xueyan Zou · Tianhe Ren · Hongyang Li · Lei Zhang · Chunyuan Li · Jianwei Yang · Jianfeng Gao


Abstract:

In-context prompting in large language models (LLMs) has become a prevalent approach to improving zero-shot capabilities, but this idea is less explored in vision domain. Existing visual prompting methods focus on referring segmentation to segment the most relevant object, falling short of addressing many general vision tasks like open-set segmentation and detection. In this paper, we introduce a unified visual in-context prompting framework for both tasks, as shown in Fig. 1. In particular, we build on top of an encoder-decoder architecture, and develop a versatile content embedder to support a variety of prompts like strokes, boxes, and points. We further enhance it to be able to take an arbitrary number of reference images as the context. Our explorations show that the proposed in-context prompting demonstrates impressive referring and generic segmentation capabilities to refer and detect, yielding competitive performance to close-set in-domain datasets and showing promising results on many open-set segmentation datasets.

Live content is unavailable. Log in and register to view live content