Poster
Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment
Yang Bai · Yucheng Ji · Min Cao · Jinqiao Wang · Mang Ye
[
Abstract
]
Abstract:
Traditional text-based person retrieval (TPR) relies on a single-shot text as query to retrieve the target person, assuming that the query completely captures the user's search intent. However, in real-world scenarios, it can be challenging to ensure the information completeness of such single-shot text. To address this limitation, we propose chat-based person retrieval (), a new paradigm that takes an interactive dialogue as query to perform the person retrieval, engaging the user in conversational context to progressively refine the query for accurate person retrieval. The primary challenge in ChatPR is the lack of available dialogue-image paired data. To overcome this challenge, we establish , the first dataset designed for ChatPR, which is constructed by leveraging large language models to automate the question generation and simulate user responses. Additionally, to bridge the modality gap between dialogues and images, we propose a dialogue-refined cross-modal alignment () framework, which leverages two adaptive attribute refiners to bottleneck the conversational and visual information for fine-grained cross-modal alignment. Moreover, we propose a dialogue-specific data augmentation strategy, random round retaining, to further enhance the model's generalization ability across varying dialogue lengths. Extensive experiments demonstrate that DiaNA significantly outperforms existing TPR approaches, highlighting the effectiveness of conversational interactions for person retrieval. The dataset and code will be made publicly available.
Live content is unavailable. Log in and register to view live content