MMDIR: Multimodal Instruction-Driven Framework for Mixed-Degradation Document Image Restoration
Heng Li ⋅ Xingyuan Wang ⋅ Yang Fan ⋅ Yunan Zhang ⋅ Xiangping Wu ⋅ Qingcai Chen
Abstract
Restoring degraded document image is essential for both improving visual quality and optimizing performance in downstream document analysis tasks. Although existing methods have demonstrated substantial improvements in restoration outcomes, they primarily address single-type degradation scenarios. Current approaches typically necessitate training multiple specialized models for specific degradation types or rely on explicit prior knowledge of degradation patterns to guide the training process. To overcome these limitations, we propose $\textit{MMDIR}$, a multimodal instruction-driven framework designed for document image restoration under mixed and uncertain degradation conditions. By leveraging semantically structured instructions, MMDIR dynamically identifies present degradation types (blur, shadow, text watermark, and seal), while enhancing degradation-aware representation learning. Furthermore, we introduce a novel benchmark named $\textit{MixedDoc}$ comprising complex mixed degradations, where each image contains randomized combinations of the aforementioned types. This benchmark addresses a critical gap in existing datasets, which lack realistic multi-degradation samples and often overlook common obstructions such as seals and text watermarks. The effectiveness of our approach is thoroughly validated across both released public benchmarks and our newly proposed dataset.
Successful Page Load