Skip to yearly menu bar Skip to main content


Poster

Fast Convergence of Diffusion Transformers in a Better High-Dimensional Latent Space

Jingfeng Yao · Bin Yang · Xinggang Wang


Abstract:

Latent diffusion models (LDM) with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: increasing the per-token feature dimension in visual tokenizers improves reconstruction quality but requires substantially larger diffusion models and extended training time to maintain generation performance. This results in prohibitively high computational costs, making high-dimensional tokenizers impractical. In this paper, we argue that this limitation stems from the inherent difficulty of learning unconstrained high-dimensional latent spaces and address this limitation by aligning the latent space with pre-trained vision foundation models. Our VA-VAE (Vision foundation model Aligned Variational AutoEncoder) expands the Pareto frontier of visual tokenizers, enabling 2.7 times faster Diffusion Transformers (DiT) convergence in high-dimensional latent space. To further validate our approach, we optimize a DiT baseline, referred to as LightningDiT, achieving superior performance on class conditional generation with only 6% of the original training epochs. The integrated system demonstrates the effectiveness of VA-VAE, achieving 0.28 rFID and 1.73 gFID on ImageNet-256 generation in 400 epochs—outperforming the original DiT's 0.71 rFID and 2.27 gFID in 1400 epochs, without more complex designs. To our knowledge, this marks the first latent diffusion system to achieve both superior generation and reconstruction without increasing training costs. Our codes and weights will be open source.

Live content is unavailable. Log in and register to view live content