Gravitation-Driven Semantic Alignment for Text Video Retrieval
Abstract
The inherent semantic ambiguity of “many-to-many”, where one video matches multiple texts and vice versa, aggravates the difficulty in text-video retrieval. The dominant deterministic embeddings only struggle to capture the mean semantics, while existing probabilistic methods fail to distinguish hard negatives for their imposing rigid uncertainty priors or ignoring the interaction between similarity and uncertainty. To this end, we propose a novel physics-inspired framework (GraviAlign) that decomposes the alignment of cross-modal semantic distributions into two orthogonal factors inspired by the Gravitational Force: (1) Semantic Attraction measuring gravitational alignment between distribution centers via uncertainty-derived “semantic mass” and “semantic distance”; (2) Geometric Overlap quantifying distribution intersection. Each factor has independent veto power to reject those matches with misalignment or poor overlap. Additionally, GraviAlign offers an efficient (O(D)), theoretically grounded alternative to intractable joint integrals. Extensive experiments on DiDeMo, MSR-VTT, and ActivityNet demonstrate our effectiveness and superiority, and solid ablation studies confirm the indispensability of two novel components.