TriSim: Tri-Dimensional Similarity Modeling with Extreme Value Theory for False-Negative Mitigation in Remote Sensing Image-Text Retrieval
Chengyu Zheng ⋅ Hanzhang Lu ⋅ Jie Nie ⋅ Shan Du
Abstract
In remote sensing (RS) cross-modal retrieval, most existing methods employ contrastive learning as their primary optimization objective, aligning anchors with positive counterparts and distinguishing them from negative samples. To improve negative sampling, these approaches typically set thresholds on cross-modal similarity scores, designating negatives that exceed the threshold as false negative samples (FNS). However, dependence on a single cross-modal similarity threshold is fragile because it fails to account for the cross-modal semantic overlaps and gaps. To address these challenges, we introduce TriSim, a novel image-text retrieval framework that constructs a tri-dimensional negative similarity space $<$img-img, img-txt, txt-txt$>$ to mitigate the influence of FNS issue. Specifically, considering that FNS appear as anomalies in this space, Extreme Value Theory (EVT) is applied to model the statistical behavior of the tail distribution for FNS selection. Two complementary tail selection strategies are developed: one identifies samples distant from the dense ellipsoidal center, and the other targets upper-right high-similarity extremes. The selected tail samples are regarded as FNS and modeled using a generalized Pareto distribution, with probabilistic weights assigned in the triplet loss. To further refine the selected FNS, intra-modal saliency differences are computed to generate masks that guide the learning of a gain matrix, which amplifies highly discriminative regions and suppresses ambiguous ones. Extensive experiments on two benchmarks demonstrate the superiority of the proposed TriSim framework in mitigating the influence of false negatives in RS image-text retrieval.
Successful Page Load