POGA: Paraphrased and Oppositional Graph Alignment for Fine-Grained Cross-Modal Retrieval
Abstract
Most of the models used to generate embeddings for retrieval are not trained for the purpose which leads them to focus on coarse semantic alignment rather than particular object attributes or arrangements. This limits their performance, particularly on challenging problems such as cross-modal fine-grained retrieval. Furthermore, their training objectives lack the discriminative ability required to distinguish between descriptions that are semantically similar but factually different. To address these challenges, we propose POGA (Paraphrased and Oppositional Graph Alignment), a novel framework for fine-grained cross-modal alignment. POGA comprises two core innovations: (1) Multi-source Graph Augmentation (MSGA), which not only generates paraphrased positives and oppositional negatives, but also parses the image and all text variants into structured graphs to provide difference-rich supervisory signals; (2) Hybrid Multi-granularity Alignment (HMA), which defines a composite training objective that jointly optimizes the model at four distinct granularities: including robust dual global alignment, and precise matching at three fine-grained levels: node, relation, and focal disproving. Experiments on benchmarks such as DCI and DOCCI demonstrate that POGA performs favorably against several state-of-the-art methods in long-text understanding and complex relation discrimination.