Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation
Abstract
Text-to-Image (T2I) generation technology has achieved remarkable progress in recent years. Concurrently, reinforcement learning methods, particularly those based on Group Relative Policy Optimization (GRPO), have attracted widespread attention and have been successfully applied to T2I tasks. However, the uniform sampling strategy commonly adopted during training often ignores the match between sample difficulty and the model’s current learning capability, leading to low training efficiency. We argue that the key to unleashing the model’s potential lies in continuously providing ``high-value samples'' that match its evolving competence. To this end, we propose Curriculum Group Policy Optimization (CGPO), an adaptive curriculum training framework. During training, each prompt is used to generate a group of images, and a reward model assigns a reward to each image. We use the variance of these rewards as a proxy indicator—higher variance implies the model's understanding of the prompt is still unstable, indicating stronger learnability and thus higher value. CGPO adaptively constructs the curriculum by dynamically identifying and selecting high-value samples for training based on reward variance. Additionally, to address data imbalance in multi-category datasets, we design a category calibration method based on proportional fairness optimization, which balances training difficulty across categories. Experiments on GenEval, T2I-CompBench++, and DPG Bench demonstrate that our framework effectively improves generation performance.