Skip to yearly menu bar Skip to main content


Poster

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Linke Ouyang · Yuan Qu · Hongbin Zhou · Jiawei Zhu · Rui Zhang · Qunshu Lin · Bin Wang · Zhiyuan Zhao · Man Jiang · Xiaomeng Zhao · Jin Shi · Fan Wu · Pei Chu · Minghao Liu · Zhenxiang Li · Chao Xu · Bo Zhang · Botian Shi · Zhongying Tu · Conghui He

ExHall D Poster #364
[ ] [ Paper PDF ]
[ Poster
Sun 15 Jun 8:30 a.m. PDT — 10:30 a.m. PDT

Abstract:

Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies.

Chat is not available.