Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?
Abstract
Diffusion models have achieved outstanding success in image generation, yet their objectives are often limited to reconstruction, making it difficult to align with human preferences directly. Reinforcement learning (RL) offers a promising approach to address this by optimizing models using explicit reward signals. However, most studies apply RL across the entire denoising process, which is both computationally expensive and tends to weaken preference alignment, i.e., doing more but achieving less. We observe that the impact of RL fine-tuning varies significantly across denoising stages. In the early stage, image structures are unstable and distant from the final reward signal. Applying RL at this stage leads to delayed rewards and action–reward mismatching, resulting in high variance and inefficient updates. Conversely, in the later stage, reward gains saturate, and continued training tends to overfit local details, intensifying reward hacking. To tackle these challenges, we propose \ourmethod{}, an RL-enhanced plug-in that improves generation quality while reducing computational cost. Specifically, \ourmethod{} adaptively identifies the optimal timing for RL by perceiving the structural evolution and semantic consistency during denoising, and dynamically terminates training once the denoising converges and reward gains saturate. As a result, it achieves a rare `dual benefit': a reduction in computational costs alongside a significant performance improvement. Theoretical analysis from an entropy perspective and extensive experiments verify our claims: compared with state-of-the-art methods, \ourmethod{} improves performance by xx\% while cutting computational cost by xx\%.