Skip to yearly menu bar Skip to main content


DIEM: Decomposition-Integration Enhancing Multimodal Insights

Xinyi Jiang · Guoming Wang · Junhao Guo · Juncheng Li · Wenqiao Zhang · Rongxing Lu · Siliang Tang

Arch 4A-E Poster #303
[ ]
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT


In image question answering, due to the abundant and sometimes redundant information, precisely matching and integrating the information from both text and images is a challenge. In this paper, we propose the Decomposition-Integration Enhancing Multimodal Insight (DIEM) which initially decomposes the given question and image into multiple subquestions and several sub-images aiming to isolate specific elements for more focused analysis. We then integrate these sub-elements by matching each subquestion with its relevant sub-images, while also retaining the original image, to construct a comprehensive answer to the original question without losing sight of the overall context. This strategy mirrors the human cognitive process of simplifying complex problems into smaller components for individual analysis, followed by an integration of these insights. We implement DIEM on the LLaVA-v1.5 model, and evaluate its performance on ScienceQA and MM-Vet. Experimental results indicate that our method boosts accuracy in most question classes of the ScienceQA (+2.03% in average), especially in the image modality (+3.40%). On MM-Vet, our method achieves an improvement in MM-Vet scores, increasing from 31.1 to 32.4. These findings highlight DIEM's effectiveness in harmonizing the complexities of multimodal data, demonstrating its ability to enhance accuracy and depth in image question answering through its decomposition-integration process.

Live content is unavailable. Log in and register to view live content