Tackling Alignment Ambiguity in Person Retrieval through Conversational Attribute Mining
Abstract
Text-to-Image Person Retrieval (TIPR) aims to retrieve pedestrian images with a given natural-language description. It remains highly challenging due to the inherent ambiguity in cross-modal alignment: existing models often struggle to capture fine-grained correspondences, and their understanding of detailed pedestrian attributes is typically confined to partial or coarse cues, leading to mismatched or erroneous retrieval results.To overcome this challenge, we propose CECA, a Conversation-Enhanced Cross-modal Alignment framework. CECA strengthens the attribute correspondence between textual and visual modalities through multimodal large language models (MLLMs)-guided dialogue, enhances detailed cross-modal matching via a Bidirectional Correlation Matching (BCM) mechanism, and stabilizes optimization with a Confidence-Aware Weighting Loss (CAWL) that reduces the impact of low-quality conversational responses. Extensive experiments on three public benchmarks demonstrate the superior performance and strong generalization ability of our approach.