MR-RAG: Multimodal Relevance-Aware Retrieval-Augmented Generation for Medical Visual Question Answering
Abstract
Large Vision Language Models (LVLMs) with retrieval-augmented generation (RAG) are emerging as a main paradigm for processing vision-language medical tasks due to their promising achievements. However, existing approaches exhibit two significant limitations in the two key stages—retrieval and generation. First, during the retrieval stage, most methods typically rely on a single similarity signal to estimate document relevance, ignoring the rich information available in multimodal data, which may fail to accurately retrieve matching content. Second, in the generation stage, retrieved documents are integrated directly and uniformly into the input for LVLMs, without taking into account their varying relevance to the question, which may result in the dilution of crucial information and exacerbate the negative impact of irrelevant content. To address these limitations, we propose MR-RAG, a dual-stage RAG enhancement framework by considering multimodal relevance in both retrieval and generation phases. Specifically, we first introduce a Multimodal Cooperative Retrieval (MCR) module that leverages both intra-modal and cross-modal signals to jointly retrieve semantically aligned documents. Then, we design an Importance-Aware Information Flow Augmentation (IFA) mechanism that augments attention paths based on the fused multimodal relevance, enabling more precise control over the information flow during answer generation. By coherently bridging retrieval and generation via multimodal signals, our method significantly enhances factual accuracy and robustness. Experiments on three medical datasets demonstrate that our method outperforms state-of-the-art baselines, achieving up to 6.4% accuracy improvement.