Progressive Supernet Training for Efficient Visual Autoregressive Modeling
Abstract
Visual Autoregressive (VAR) models have demonstrated competitive performance with diffusion models in image generation by adopting a "next-scale" prediction paradigm that significantly reduces inference steps. However, VAR's progressive multi-scale generation leads to severe memory overhead due to KV cache accumulation across all scales, limiting practical deployment. Existing solutions either require training and deploying multiple specialized models or sacrifice generation quality.We observe a critical scale-depth asymmetric dependency in VAR: small scales (low-resolution tokens) are highly sensitive to network depth and require deep layers to capture global semantic information, while large scales (high-resolution tokens) exhibit remarkable robustness to depth reduction.Motivated by this insight, we propose VARiant, a unified supernet framework that enables dynamic depth adjustment within a single model through parameter sharing. Our approach employs an even-spacing layer selection strategy to construct quality-preserving subnetworks, and introduces a dynamic-ratio progressive training strategy that gradually transitions from joint optimization (full network to subnetwork ratio 2:8) to subnetwork optimization (ratio 10:0), effectively resolving the inherent optimization conflicts between full network and subnetworks in supernet training.Extensive experiments on ImageNet demonstrate that our method achieves Pareto-optimal trade-offs between generation quality and inference efficiency: by using full depth (30 layers) for the first 7 scales and a 16-layer subnetwork (50% depth) for subsequent scales, we obtain 50% cache reduction and 1.8× inference speedup with minimal quality loss (FID increases by only 0.3).Unlike approaches requiring deployment of multiple models, our single-model solution eliminates deployment complexity, supports zero-cost runtime depth switching, and seamlessly integrates into standard transformer inference frameworks, making it highly practical for resource-constrained scenarios.