Poster Sat, Jun 6, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 220

M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models

Joongmin Shin ⋅ Jeongbae Park ⋅ Jaehyung Seo ⋅ Heuiseok Lim

Project Page

Abstract

In large-scale industrial documents with scanned images, complex layouts, and multiple pages, the effectiveness of retrieval-augmented generation (RAG) is highly dependent on chunking quality. However, existing text-centric chunkers overlook the visual and structural cues present in real-world documents, leading to redundant or ambiguous chunks that impair retrieval and answer accuracy. To address this problem, we propose \textbf{\ours} which integrates (i) SharedDet for normalizing document parsing and OCR outputs into a document-level frame, (ii) Multi-modal block embeddings with boundary-aware SoftROI, (iii) global document-tree reconstruction via biaffine scoring, and (iv) structure-aware dependency chunking that preserves boundaries and reduces redundancy. \ours achieves consistent gains across both Document Hierarchical Parsing (DHP) and corpus-level RAG evaluations, improving STEDS by +28.5--39.6\%, retrieval nDCG by +1.1--15.3\%, and QA ANLS by +4.5--15.3\%. These results demonstrate that modeling document-level dependencies with Multi-modal, structure-aware chunking improves RAG performance on long, multi-page industrial documents.