Paper
in
Workshop: 8th Multimodal Learning and Applications Workshop

MVCM: Enhancing Multi-View and Cross-Modality Alignment for Medical Visual Question Answering and Medical Image-Text Retrieval

Yuanhao Zou · Zhaozheng Yin

Abstract

Recent advancements in medical vision-language tasks, such as Medical Visual Question Answering (Med-VQA) and Medical Image-Text Retrieval (Med-ITR), aim to jointly learn from images and texts. However, two main issues persist in the field: the neglect of multi-view medical images and incomplete cross-modality understanding. Current studies often treat each image-text pair as independent instances (\ie, at the instance-level), neglecting the comprehensive contextual information available from multi-view images of the same study. Although some methods have explored refined alignments, combining alignment of global representation with the token-wise alignment of local representations, they often utilize only a uni-modality encoder (\eg, visual encoder) for downstream applications, lacking comprehensive cross-modality understanding. To address these issues, this paper introduces a framework MVCM that supports Multi-View and Cross-Modality alignment for Med-VQA and Med-ITR tasks. Our proposed method fully utilizes multi-view images in radiology datasets and aligns them at the study-level. We also employ various pretext tasks to support cross-modality alignment. We fine-tune the proposed model on downstream tasks Med-VQA and Med-ITR, outperforming state-of-the-art methods across multiple datasets. The code will be publicly available.

Chat is not available.