Quota-Calibrated Fine-Grained Alignment with Context-Aware Marginals for Text-based Person Retrieval
Abstract
The core challenge in Text-based Person Retrieval (TPR) lies in establishing fine-grained, many-to-many semantic alignment between textual words and visual regions. Existing methods predominantly rely on pointwise similarity or attention mechanisms, implicitly assuming matches are independent and balanced. Consequently, under conditions of attribute overlap and substantial background noise, these methods often misallocate matching weights to non-discriminative regions or words, resulting in ambiguous matching outcomes. To address this, we propose QC-Align, a quota-calibrated fine-grained alignment framework guided by context-aware marginals. Specifically, we propose a Context-Aware Marginal Estimator (CAME) that dynamically assigns "matching quotas" to each word and visual region, and subsequently employs a Quota-Calibrated Transport (QCT) objective to explicitly constrain the matching quality each word and region can carry, thereby jointly optimizing the many-to-many correspondence between text and vision under these constraints. Notably, QC-Align is a parameter-free, plug-and-play training regularizer that requires no fine-grained annotations and incurs no inference overhead. Experiments on multiple mainstream person retrieval benchmarks demonstrate that QC-Align consistently improves baseline model performance, with greater gains and better interpretability in few-shot and cross-domain scenarios.