Poster
Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward
Zhiwei Jia · Yuesong Nan · Huixi Zhao · Gengdai Liu
[
Abstract
]
Abstract:
Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, offering great flexibility in model alignment. However, it is challenging to apply existing RL methods to timestep-distilled DMs for ultra-fast (-step) image generation.Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO towards improving -step image generation.Based on the insights, we propose to fine-tune DMs with learned differentiable surrogate rewards.Our method, named \textbf{LaSRO}, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for efficient reward gradient guidance.LaSRO leverages pre-trained latent DMs for reward modeling and specifically targets image generation steps for reward optimization, enhancing generalizability and efficiency.We show that LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including those based on PPO or DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights behind it.
Live content is unavailable. Log in and register to view live content