Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding
Abstract
Document understanding is a long-standing practical task. Vision-Language Models (VLMs) have gradually become a primary approach in this domain, demonstrating effective performance on single-page tasks. However, their effectiveness diminishes when handling long documents. In such scenarios, clues are often scattered across multiple pages and modalities, and redundancy from lengthy inputs can impair the model's judgment. While retrieval-augmented generation mitigates this issue by filtering for question-relevant content, the retrieved results still contain substantial redundancy. To address these limitations, we propose SLEUTH, a multi-agent framework. Concretely, SLEUTH orchestrates a retriever and four collaborative agents in a coarse-to-fine process. The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy. It ultimately synthesizes a distilled, evidence-dense multimodal context to generate the final prediction. The SLEUTH framework is model-agnostic and scalable. When paired with advanced VLM backbones, it consistently improves performance on multiple long-document benchmarks, achieving SOTA results. Ablation studies verify each module’s effectiveness and confirm the benefits of our hierarchical refinement paradigm.