MCHDoc: A Comprehensive Benchmark for Reading Multi-Carrier Chinese Historical Documents
Abstract
Chinese historical documents are essential carriers for the inheritance and dissemination of traditional Chinese culture.However, traditional manual digitization of different types of historical carriers is not only time-consuming and labor-intensive but also heavily reliant on experts with specialized knowledge of the specific carrier domains.In the past,experts read the Chinese historical documents relying on the recognition of the documents and consulted a large number of professional books for citation and correction .With the emergence of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), we see new opportunities for uniformly reading different types of carriers. Nevertheless, existing studies mainly focus on evaluating the OCR capabilities of MLLMs, without incorporating citation or retrieval functionalities, and are restricted to a single type of carrier.To address this,we introduce MCHDoc,a comprehensive benchmark for reading multi-carrier Chinese historical documents.This benchmark consists of 15,723 documents and covers six types of carriers, including Inscription, AncientBook, Calligraphy, Oracle Bone, Silk, and JianDu(bamboo slip).Based on this benchmark,we evaluate various MLLMs and LLMs to test their capacities of reading multi-carrier Chinese historical documents. The results reveal that the top MLLMs and LLMs achieve excellent performance on some type of carriers. but there is still some place for them to read the multi-type carriers perfectly.Overall,MCHDoc is a standardized and comprehensive benchmark for reading Chinese historical document, providing valuable insights for Chinese cultural study.