Guiding a Diffusion Transformer with the Internal Dynamics of Itself
Abstract
The diffusion model presents a powerful ability to obtian the entire (conditional) data distribution.However, due to the lack of sufficient training and data to learn, the model will be penalized for failing to cover low-probability areas.To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage.However, the standard CFG often leads to over-simplified or distorted samples. And the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer's outputs to obtain generative results during sampling process.This simple strategy yields significant improvements in both training efficiency and generation quality on DiTs and SiTs.On ImageNet 256×256, SiT-XL/2+IG achieves FID=5.31 and FID=1.88 which already exceeds the FID of the vanilla SiT-XL and REPA.More impressively, LightningDiT-XL/1+IG achieves FID=1.41 which achieves a large margin between all of these methods.Combined with classifier free guidance, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.23.