Rethinking Cross-Modal Anchor Alignment for Mitigating Error Accumulation
Abstract
Mitigating noisy correspondence in cross-model matching poses a serious challenge due to the problem of error accumulation. Existing methods primarily attribute this accumulation to errors caused by noisy sample pairs. However, a novel source of error from clean sample pairs (also termed anchor pairs) is discovered in this paper. Such error accumulation is considered to arise from modality-inconsistent correlations. To address this issue, a novel method termed Geometric-Semantic Learning (GSL) is proposed. Firstly, GSL leverages the Fourier transform to emphasize semantic representations and reduce cross-modal inconsistencies caused by perturbations in non-critical fine-grained features, thereby alleviating the error accumulation problem. After that, a Geometry-Aware Label Correction (GALC) method is introduced to re-estimate soft correspondence labels by leveraging angular consistency between noisy sample pairs and anchor pairs across different modalities. Finally, a semantically constrained triplet loss is employed to regulate sample distances using semantic information, enabling robust separation of clean and noisy pairs during the training process. Extensive experiments on three benchmark datasets demonstrate that GSL consistently outperforms existing methods in retrieval accuracy.