DREAM: Document Recognition with Explicit Adaptive Memory
Abstract
Large multimodal models (LMMs) have shown promising performance for various document recognition tasks. However, LLMs adopt implicit modeling, and the parameters lack interpretability. Inspired by recent advances in human memory and learning research, We propose an explicit multiscale prototype memory that augments document recognition models, explicitly modeling recurrent layout and stylistic patterns across different spatial resolutions. A Memory Retrieval Mechanism enables local regions to sparsely attend to a few prototypes (e.g., image borders, tilted text); the retrieved compositional factors are concatenated with visual features and passed to the decoder, providing explicit region-wise structural context. Prototype memory consolidation updates and stabilizes prototypes via attention-weighted exponential moving average (EMA) strategy, while sparsity and anti-collapse regularization promote selective activation and disentanglement. We further adopt hierarchical memory and a scale-adaptive attention module for multi-resolution encoding, trained with a multi-task, entropy-regularized objective. We validate on two tasks including document recognition on the Fox and the self-built DreamDoc dataset, and handwriting recognition on the SCUT-HCCDoc and SCUT-EPT Chinese handwriting datasets. Experimental results show that the proposed method is effective.