Easy2Hard: From Partially to Fully Unmatched Modalities as Negative Samples in Contrastive Learning
Abstract
Contrastive learning is widely used for generating multimodal data representations by aligning embeddings of different modalities of the same data samples. This alignment is achieved through a loss function that treats matched and unmatched modality pairs as positive and negative samples within a data batch. However, when extending contrastive learning to scenarios involving more than two modalities, existing approaches either rely solely on fully unmatched modalities as negative samples, or fail to distinguish between partially and fully unmatched modalities, thereby overlooking the fine-grained contrasting relationships. To address this limitation, we propose Easy2Hard, a novel framework that explicitly separates partially and fully unmatched modalities. To learn from negative samples at improved granularity, Easy2Hard further introduces a sigmoid weighting curriculum that smoothly transitions the learning process from easy (partially unmatched) to hard (fully unmatched) contrasts. Comprehensive evaluations on five multimodal datasets across diverse domains demonstrate the superiority of our approach.