DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment
Xin Cai ⋅ Zhiyuan You ⋅ Zhoutong Zhang ⋅ Tianfan Xue
Abstract
Reducing token number in the latent diffusion is important for both efficient training and inference, especially at high resolution.A common approach is to design high-compression image tokenizers that store more information per token by increasing the number of channels. However, packing more details into each token tends to make the latent space less structured, which in turn makes diffusion training difficult. To solve this, current solutions use semantic alignment or training-time dropout to impose structures in the latent space, which often requires retraining the diffusion model from scratch. Can we increase the compression ratio of the image tokenizer, while not requiring expensive re-training? As we find out, a simple solution is to explicit add channels to the existing latent to capture image details, and align them towards the latent from the pre-trained diffusion model. Our method, \textbf{D}etail-\textbf{A}ligned VAE, increases the compression ratio of a pretrained VAE, while only require a light-weight adaptation stage for the corresponding pretrained diffusion backbone. Specifically, DA-VAE imposes an explicit latent structure: the first $C$ channels of the latent space is given by the pre-trained VAE, encoding the input image at half the resolution. We use an extra of $D$ channels to encode details of the image at full-res.To make this new latent diffusion friendly, we introduces a simple detail alignment strategy that constraints the extra $D$ channels to have similar structures of the first $C$ channels. With such a design, we provide a warm-start finetuning recipe which effectively enables $1024\times 1024$ image generation with Stable Diffusion 3.5, using only $32\times32$ tokens, $4\times$ less than the original model.This adaptation only takes 5 H100 days. We also show that we could unlock $2048\times2048$ image generation with SD3.5, with $6\times$ speed up and more stable image structure. We further validate the effectiveness of our method and design decisions quantitatively on ImageNet.
Successful Page Load