Rethinking the Semantic-based Autoencoder
Abstract
Latent generative modeling has emerged as the dominant paradigm for Diffusion Transformers (DiT), where a pretrained autoencoder compresses image pixels into a latent space to facilitate the diffusion process. Recently, the use of semantic encoders within autoencoders (AEs) has gained attention, yet their influence on image reconstruction and diffusion model training remains insufficiently explored. In this study, we perform an in-depth examination of how semantic encoders shape latent representation learning for the autoencoders. Our findings reveal a fundamental trade-off: while semantic encoders generate latent spaces enriched with visual semantics, their high level of abstraction makes it challenging to capture fine-grained geometric relationships, thereby requiring larger models and longer training for convergence. To address this issue, we build upon recent advances in representation learning that enable the joint modeling of both semantic abstraction and geometric detail. This leads to a Semantic Auto-Encoder (S-AE) that achieves state-of-the-art performance, combining superior reconstruction quality and discriminative capability. Specifically, with S-AE, we are able to provide a unified latent space that achieves 0.06 FID for image reconstruction and 81.9\% classification accuracy on ImageNet, set a state-of-the-art benchmark. Codes and model weights will be made publicably available.