Skip to yearly menu bar Skip to main content


SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training

WU Sitong · Haoru Tan · Zhuotao Tian · Yukang Chen · Xiaojuan Qi · Jiaya Jia

Arch 4A-E Poster #308
[ ]
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT


Vision-language pre-training (VLP) aims to learn joint representations of vision and language modalities. The contrastive paradigm is currently dominant in this field. However, we observe a notable misalignment phenomenon, that is, the affinity between samples has an obvious disparity across different modalities, namely ''Affinity Inconsistency Problem''. Our intuition is that, for a well-aligned model, two images that look similar to each other should have the same level of similarity as their corresponding texts that describe them. In this paper, we first investigate the reason of this inconsistency problem. We discover that the lack of consideration for sample-wise affinity consistency across modalities in existing training objectives is the central cause. To address this problem, we propose a novel loss function, named Sample-wise affinity Consistency (SaCo) loss, which is designed to enhance such consistency by minimizing the distance between image embedding similarity and text embedding similarity for any two samples. Our SaCo loss can be easily incorporated into existing vision-language models as an additional loss due to its complementarity for most training objectives. In addition, considering that pre-training from scratch is computationally expensive, we also provide a more efficient way to continuously pre-train on a converged model by integrating our loss. Experimentally, the model trained with our SaCo loss significantly outperforms the baseline on a variety of vision and language tasks.

Live content is unavailable. Log in and register to view live content