Poster Fri, Jun 5, 2026 • 9:45 AM – 11:45 AM PDT ExHall A-F 252

POGA: Paraphrased and Oppositional Graph Alignment for Fine-Grained Cross-Modal Retrieval

Junfeng Zhang ⋅ Zhe Xue ⋅ Yuankai Qi ⋅ Junping Du ⋅ Xiangyang Kong ⋅ Yishuo Yan ⋅ Amin Beheshti ⋅ Jian Yang ⋅ Anton van den Hengel ⋅ Ming-Hsuan Yang

Abstract

Most of the models used to generate embeddings for retrieval are not trained for the purpose which leads them to focus on coarse semantic alignment rather than particular object attributes or arrangements. This limits their performance, particularly on challenging problems such as cross-modal fine-grained retrieval. Furthermore, their training objectives lack the discriminative ability required to distinguish between descriptions that are semantically similar but factually different. To address these challenges, we propose POGA (Paraphrased and Oppositional Graph Alignment), a novel framework for fine-grained cross-modal alignment. POGA comprises two core innovations: (1) Multi-source Graph Augmentation (MSGA), which not only generates paraphrased positives and oppositional negatives, but also parses the image and all text variants into structured graphs to provide difference-rich supervisory signals; (2) Hybrid Multi-granularity Alignment (HMA), which defines a composite training objective that jointly optimizes the model at four distinct granularities: including robust dual global alignment, and precise matching at three fine-grained levels: node, relation, and focal disproving. Experiments on benchmarks such as DCI and DOCCI demonstrate that POGA performs favorably against several state-of-the-art methods in long-text understanding and complex relation discrimination.