LazyVAR: Accelerating Visual Autoregressive Models via Scale-wise Token Pruning and Parallel Group Decoding
Abstract
Visual Autoregressive (VAR) modeling introduces a new paradigm for image generation by extending autoregressive mechanisms from next-token prediction to next-scale prediction, achieving remarkable performance. However, as the number of tokens increases rapidly with scale, processing full token maps at high resolution becomes computationally expensive. In addition, the inherently sequential nature of autoregressive modeling prevents parallel inference across scales, which further increases latency.To address these challenges, we propose LazyVAR, a training-free and plug-and-play acceleration method for VAR models. Our key observation is that the similarity of aggregated latent features between adjacent scales progressively increases with the scale index, reaching particularly higher values at larger scales. We treat this similarity as a Scale-Wise Update Index, which serves as the pruning criterion. Consequently, more tokens can be pruned at larger scales to improve efficiency. Furthermore, we propose Parallel Group Decoding, which leverages this high similarity at larger scales to decode tokens from different scales in parallel, further accelerating inference.Experimental results show that the proposed LazyVAR achieves up to a 2.94× speedup over FlashAttention-accelerated VAR models with negligible performance loss, allowing the Infinity-2B text-to-image model to generate 1024×1024 resolution images within 0.5 seconds on a single RTX 4090 GPU. Our code will be publicly available.