M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA
Abstract
Document QA requires not only accurate answers but also identifying where each answer is grounded on the page. Most models treat the task as text-only generation, while existing answer grounding methods generate coarse bounding boxes that fail to capture curved text. We introduce M3Grounder, a hybrid vision–language and segmentation architecture that formulates document grounding as pixel-level segmentation. It produces fine-grained evidence masks refined by a bleed-suppression loss to prevent spillover. M3Grounder autoregressively generates answer text interleaved with [GROUND] tokens that link individual answer spans to their corresponding evidence regions. Also, M3Grounder grounds evidence hierarchically across phrase, line, and block levels using an enclosure loss that enforces spatial containment. We release GroundingDocQA dataset (200K documents, 2M multi-span and multi-granular QA pairs with pixel-level grounding masks), built through a data engine that handles complex layouts, curved-text, and graphics-rich documents. We also release GroundingDocQA-Bench, a diverse and challenging human-verified benchmark. M3Grounder sets a new state of the art in grounded DocVQA, advancing from coarse boxes to hierarchical, fine-grained and contextually grounded evidence.