Poster
Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark
Hao Guo · Xugong Qin · Jun Jie Ou Yang · peng zhang · Gangyan Zeng · Yubo Li · Hailun Lin
Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query. Existing DIR methods are primarily based on image queries that retrieves documents within the same coarse semantic category, e.g., newspapers or receipts. However, these methods struggle to effectively retrieve document images in real-world scenarios when using fine-grained semantics from text queries. To bridge this gap, this paper introduces a new benchmark of Natural Language-based Document Image Retrieval (NL-DIR) along with corresponding evaluation metrics. In this work, natural language descriptions serve as semantically rich queries for the DIR task. The NL-DIR dataset contains 41K authentic document images, each paired with five high-quality, fine-grained semantic queries generated and evaluated through large language models in conjunction with manual verification. We propose a two-stage retrieval method for DIR that enhances retrieval performance while optimizing both time and space efficiency. Furthermore, we perform zero-shot and fine-tuning evaluations of existing contrastive vision-language models and OCR-free visual document understanding (VDU) models on this dataset. The datasets and codes will be publicly available to facilitate research in the VDU community.
Live content is unavailable. Log in and register to view live content