CoV-Align: Efficient Fine-grained Cross-Modal Alignment with Cohesive Visual Semantics Priority
Hengqi Liu ⋅ Wanting Zhou ⋅ Longteng Kong ⋅ Fangxiang Feng ⋅ Lei Ren ⋅ Wei Chen ⋅ Xiaojie Wang
Abstract
Cross-modal alignment aims to learn semantically consistent latent representations across diverse modalities. Prevailing methods rely on a text-guided aggregation paradigm to achieve fine-grained alignment, while they suffer from redundant patch-word correlations and high computational costs. To address these issues, we propose CoV-Align, an effective and efficient fine-grained cross-modal alignment framework with cohesive visual semantics priority. Through a semantically convergent attention mechanism, it progressively aggregates meaningful visual patches in a text-free manner. We design a coarse visual semantic feature extractor that integrates deformable attention and consist assign attention to group patches with semantic consistency. A cohesive and discriminative feature optimization is presented to enhance intra-semantic cohesion and inter-semantic discriminability of visual region features, resulting in explicit improvements in cross-modal alignment. Extensive experiments demonstrate that CoV-Align achieves state-of-the-art performance on the Flickr30K and MS-COCO benchmarks. Notably, it delivers a 3–5$\times$ computational speedup compared to pioneer approaches, offering compelling advantages for large-scale multi-modal tasks.
Successful Page Load