TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
Abstract
Visual Text Rendering (VTR) remains a critical challenge in text‑to‑image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment.% We identify a key bottleneck across both VTR evaluation and Reinforcement Learning (RL) processes: current evaluators and reward models lack the ability for fine-grained structural perception. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL‑based optimization. As a result, even state‑of‑the‑art generators (e.g., SeedDream4.0, Qwen‑Image) still struggle to render structurally faithful text.To address this, we propose TextPecker,a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any text-to-image generator. To enable this capability, we construct a recognition dataset with character‑level structural‑anomaly annotations and develop a stroke‑editing synthesis engine to expand structural‑error coverage. Experiments show that TextPecker consistently improves diverse text‑to‑image models; even on the well‑optimized Qwen‑Image, it significantly yields average gains of 4\% in structural fidelity and 8.7\% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR.Our work fills a gap in VTR optimization, providing a foundational step towards reliable and structural faithful visual text generation.