AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization
Abstract
Document generation has emerged as a crucial task for automating the creation of visually appealing and well-structured content across diverse domains. Existing methods in this field, however, suffer from some limitations in terms of application scope, document representation and dataset coverage, which greatly restricts the capabilities of document generation models. To address these challenges, we propose OmniDoc, a framework that introduces HTML/CSS as a novel document representation given its inherent advantages in hierarchical structure modeling. Leveraging HTML/CSS, OmniDoc establishes a scalable data synthesis pipeline to curate DocHTML, a large-scale document dataset containing 265,206 high-quality samples. Each document in DocHTML includes complete metadata annotations, structured HTML/CSS source code, synthesized visual assets, and rendered screenshots, spanning diverse categories, styles, and complexity levels to ensure comprehensive coverage. OmniDoc then utilizes DocHTML to fine-tune the multimodal large language models, empowering them remarkable document generation capabilities on three practical tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issues found in the fine-tuned models, we incorporate a height-aware post-training method within OmniDoc based on Group Relative Policy Optimization. By carefully designing the reward function to measure the alignment between predicted and target document heights, OmniDoc effectively alleviates the overflow problem, further enhancing model performance. Qualitative and quantitative results demonstrate the superiority of OmniDoc over baseline models across all three tasks. Extensive ablation studies manifest the effectiveness of the HTML/CSS representation, curated dataset, and height-aware reinforcement optimization.