LacTokGen: Latent Consistency Tokenizer for 1024-pixel Image Generation by 256 Tokens
Qingsong Xie ⋅ Luyuan Zhang ⋅ Zhao Zhang ⋅ Siyuan Li ⋅ Zhe Huang ⋅ Zhenyu Yang ⋅ Haonan Lu
Abstract
Image tokenization has significantly advanced visual generation and multimodal modeling, particularly when paired with autoregressive models. However, current methods face challenges in balancing efficiency and quality: high-resolution image generation either requires an excessive number of tokens or compromises critical details through token reduction. To resolve this, we propose Latent Consistency Tokenizer (LacTok) that bridges discrete visual tokens with the compact latent space of pretrained latent diffusion models (LDMs), enabling efficient representation of 1024×1024 images using only 256 tokens—a 16$\times$ compression over VQGAN. LacTok integrates a transformer encoder, a quantized codebook, and a latent consistency decoder.Direct application of LDM as the decoder results in color and brightness discrepancies;thus, we convert it to latent consistency decoder, reducing multi-step sampling to 1-2 steps for direct pixel-level supervision. To endow LacTok with text-to-image generation capabilities, we seamlessly integrate it with an autoregressive transformer, forming LacTokGen. This transformer is trained by predicting compact token sequences conditioned on text instructions.Experiments demonstrate LacTok’s superiority in high-fidelity reconstruction, with 10.8 reconstruction Fr\'{e}chet Inception Distance on MSCOCO-2017 5K benchmark for 1024×1024 image reconstruction.LacTokGen achieves 0.73 score on GenEval benchmark and 0.304 HPSv2 on MSCOCO-2017 dataset.
Successful Page Load