Skip to yearly menu bar Skip to main content


Poster

Rethinking Noisy Video-Text Retrieval via Relation-aware Alignment

Huakai Lai · Guoxin Xiong · Huayu Mai · Xiang Liu · Tianzhu Zhang


Abstract:

Video-Text Retrieval (VTR) is a core task in multi-modal understanding, drawing growing attention from both academia and industry in recent years. While numerous VTR methods have achieved success, most of them assume accurate visual-text correspondences during training, which is difficult to ensure in practice due to ubiquitous noise, known as noisy correspondences (NC). In this work, we rethink how to mitigate the NC from the perspective of representative reference features (termed agents), and propose a novel relation-aware purified consistency (RPC) network to amend direct pairwise correlation, including representative agents construction and relation-aware ranking distribution alignment. The proposed RPC enjoys several merits. First, to learn the agents well without any correspondence supervision, we customize the agents construction according to the three characteristics of reliability, representativeness, and resilience. Second, the ranking distribution-based alignment process leverages the structural information inherent in inter-pair relationships, making it more robust compared to individual comparisons. Extensive experiments on five datasets under different settings demonstrate the efficacy and robustness of our method.

Live content is unavailable. Log in and register to view live content