PromptEnhancer: Taming Your Rewriter for Text-to-Image Generation via Fine-Grained Reward
Abstract
Recent advances in text-to-image (T2I) diffusion models have demonstrated remarkable capabilities in generating high-fidelity images. However, these models often struggle to faithfully render complex user prompts, particularly in aspects such as attribute binding, negation, and compositional relationships. To address this challenge, we introduce PromptEnhancer, a novel and universal prompt rewriting framework that enhances any pre-trained T2I model.Specifically, we adopt a multi-stage training pipeline to systematically boost the rewriter's understanding and rewriting performance. In the first stage, we conduct supervised fine-tuning (SFT) using CoT-enabled data to enable the rewriter to generate structured, chain-of-thought-style responses. In the second stage, we design a task-specific reward model—AlignEvaluator—to further align user prompts with fine-grained preferences through GRPO.The AlignEvaluator is trained to provide explicit and fine-grained feedback based on a systematic taxonomy derived from common T2I failure cases. By optimizing the rewriter to maximize the reward from AlignEvaluator, our framework learns to generate prompts that T2I models can interpret more precisely. Furthermore, we introduce a comprehensive human-aligned benchmark to facilitate future research in this direction. Extensive experiments demonstrate that PromptEnhancer significantly improves image-text alignment across a wide range of semantic and compositional challenges.