Skip to yearly menu bar Skip to main content


Poster

D2iT: Dynamic Diffusion Transformer for Accurate Image Generation

Weinan Jia · Mengqi Huang · Nan Chen · Lei Zhang · Zhendong Mao


Abstract: Diffusion models are widely recognized for their ability to generate high-fidelity images. Despite the excellent performance and scalability of the Diffusion Transformer (DiT) architecture, it applies fixed compression across different image regions during the diffusion process, disregarding the naturally varying information densities present in these regions. However, large compression leads to limited local realism, while small compression increases computational complexity and compromises global consistency, ultimately impacting the quality of generated images. To address these limitations, we propose dynamically compressing different image regions by recognizing the importance of different regions, and introduce a novel two-stage framework designed to enhance the effectiveness and efficiency of image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical encoder to encode different image regions at different downsampling rates, tailored to their specific information densities, thereby providing more accurate and natural latent codes for the diffusion process. (2) Dynamic Diffusion Transformer (D2iT) at second stage generates images by predicting multi-grained noise, consisting of coarse-grained (less latent code in smooth regions) and fine-grained (more latent codes in detailed regions), through an innovative combination of the Dynamic Grain Transformer and the Dynamic Content Transformer. The strategy of combining rough prediction of noise with fine-grained regions correction achieves a unification of global consistency and local realism. We conduct comprehensive experiments on the ImageNet 256×256 benchmark, showing that D2iT achieves 23.8\% quality improvement than DiT (D2iT's 1.73 \vs DiT's 2.27 on FID score, lower better), by using only 57.1\% of the computational resources as DiT.

Live content is unavailable. Log in and register to view live content