MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture
Hui Li ⋅ Jiayue Lyu ⋅ Fu-Yun Wang ⋅ Kaihui Cheng ⋅ Siyu Zhu ⋅ Jingdong Wang
Abstract
This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at the training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the training performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network at each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation, validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: $1.43$ FID (without guidance) and $1.10$ (with guidance) at $256 \times 256$, and $1.55$ FID (without guidance) and $1.10$ (with guidance) at $512 \times 512$.
Successful Page Load