Poster Sun, Jun 7, 2026 • 2:30 PM – 4:30 PM PDT ExHall A 459

VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA

Young Rok Jang ⋅ Hyesoo Kong ⋅ Kyunghwan An ⋅ Jae Sub Huh ⋅ Gyeonghun KIM ⋅ Stanley Jungkyu Choi

Paper PDF

Abstract

Real-world documents combine text with tables, charts, photographs, and diagrams arranged in diverse layouts, yet existing research on multimodal large language models (MLLMs) for document QA predominantly produces text-only responses, underutilizing these visual elements. We introduce VinQA, a dataset designed for long-form answer generation where cited visual elements are explicitly interleaved with their supporting text and grounded in relevant document pages. To support this task, we study two encoding methods for feeding raw document page images into an MLLM, along with their visual-element citation mechanisms: (1) Page Encoding, which directly encodes full-page images with bounding boxes of visual elements and treats these boxed regions as citable units; and (2) Modality Encoding, which parses each page to extract text and crop visual elements, encodes them separately, and uses these cropped elements as citable units. In our experiments, we propose M-GroSE, a multimodal evaluation framework extending GroUSE to assess such answers along four dimensions: completeness, answer relevancy, faithfulness, and unanswerability. We additionally report Visual Source F1 to directly measure visual citation accuracy. Although proprietary frontier models still achieve the best overall scores on the VinQA test split, fine-tuning open Qwen2.5-VL models on the VinQA training split substantially improves their performance and markedly narrows this gap. Modality Encoding is initially more robust than Page Encoding for complex documents with long text, many visual elements, and diverse visual citation requirements. After training on VinQA, however, Page Encoding reaches a comparable performance level, showing that it can compete effectively even without the explicit parsing used in Modality Encoding. Finally, Visual G-Eval, an MLLM-based judge, confirms that fine-tuned models insert visual elements at semantically appropriate positions with faithful supporting text.