Poster
Docopilot: Improving Multimodal Models for Document-Level Understanding
Yuchen Duan · Zhe Chen · Yusong Hu · Weiyun Wang · Shenglong Ye · Botian Shi · Lewei Lu · Qibin Hou · Tong Lu · Hongsheng Li · Jifeng Dai · Wenhai Wang
Despite significant progress in multimodal large language models (MLLMs), their performance on complex, multi-page document comprehension remains inadequate, largely due to the lack of high-quality, document-level datasets.While current retrieval-augmented generation (RAG) methods offer partial solutions, they suffer from issues, such as fragmented retrieval contexts, multi-stage error accumulation, and extra time costs of retrieval. In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from original documents.Building on the dataset, we developed a native multimodal model—Docopilot, which can accurately handle document-level dependencies without relying on RAG.Experiments demonstrate that Docopilot achieves superior coherence, accuracy, and efficiency in document understanding tasks and multi-turn interactions, setting a new baseline for document-level multimodal understanding. Data, code, and models shall be released.
Live content is unavailable. Log in and register to view live content