Skip to yearly menu bar Skip to main content


A Pedestrian is Worth One Prompt: Towards Language Guidance Person Re-Identification

Zexian Yang · Dayan Wu · Chenming Wu · Zheng Lin · JingziGU · Weiping Wang

Arch 4A-E Poster #266
award Highlight
[ ]
Thu 20 Jun 5 p.m. PDT — 6:30 p.m. PDT


Extensive advancements have been made in person ReID through the mining of semantic information. Nevertheless, existing methods that utilize semantic-parts from a single image modality do not explicitly achieve this goal. Whiteness the impressive capabilities in multimodal understanding of Vision Language Foundation Model CLIP, a recent two-stage CLIP-based method employs automated prompt engineering to obtain specific textual labels for classifying pedestrians. However, we note that the predefined soft prompts may be inadequate in expressing the entire visual context and struggle to generalize to unseen classes. This paper presents an end-to-end Prompt-driven Semantic Guidance (PromptSG) framework that harnesses the rich semantics inherent in CLIP. Specifically, we guide the model to attend to regions that are semantically faithful to the prompt. To provide the personalized language descriptions for specific individuals, we propose learning pseudo tokens that represent specific visual context. This design not only facilitates learning fine-grained attribute information but also can inherently leverage language prompts during inference. Without requiring additional labeling efforts, our PromptSG achieves state-of-the-art by over 10\% on MSMT17 and nearly 5\% on the Market-1501 benchmark.

Live content is unavailable. Log in and register to view live content