Mask to Align, Weight to Disambiguate: Reliable Unsupervised Cross-Modal Hashing with Masked-Weight Contrast
Abstract
In unsupervised cross-modal hashing, real-world data often exhibit partial alignment and semantic mismatch: dominant modalities tend to overrule fusion, fine-grained complementary cues are overlooked, and mini-batch “negative samples” are contaminated by semantically related items, yielding frequent false negatives. Treating all pairs equally in contrastive learning thus makes training noise-prone and ill-suited to partially aligned data. To mitigate these pains, we present Unsupervised Weighted Masked Contrastive Hashing (UWMCH), whose core is: (i) random masked fusion deliberately suppresses part of modality evidence during feature interaction, forcing the model to learn complementary semantics under diverse “partial interactions,” avoiding reliance on a single modality and explicitly exposing hard cases; (ii) pairwise weighting no longer treats masked and unmasked pairs as equivalent but adaptively assigns a weight to each cross-modal pair by combining instance-level semantic consistency with a K-means induced cluster-consensus prior, injecting the weight into the contrastive objective to suppress suspected false negatives and amplify more informative masked positives. To stabilize the global structure, we further introduce two constraints: Cluster-Centroid Agreement (CCA) forms global semantic anchors at the prototype level in synergy with UWMCH; Semantic Structure Regularization (SSR) builds higher-order semantic structure and aligns it with cross-modal similarity, maintaining intra-modal compactness and inter-modal separability under masking. Extensive benchmark experiments show that UWMCH achieves better retrieval accuracy and convergence stability across multiple datasets. The code will be released.