FastHybrid: Accelerating Hybrid Autoregressive Image Generation with Lookahead and Guided Decoding
Abstract
Autoregressive (AR) models have achieved remarkable success in natural language processing, yet their application to image generation faces significant challenges. When implementing VQ-based decoders for autoregressive image generation, the generated images typically preserve semantic information but struggle with fine-grained details. Recent hybrid AR image generation approaches address these issues by integrating diffusion models as decoder heads, enabling more high-fidelity generation. However, the diffusion-based denoising process introduces significant computational overhead during inference.To accelerate hybrid AR image generation, we propose the Lookahead Decoding Strategy, which integrates the strengths of autoregressive and diffusion models by separating the process into two complementary branches: semantic prediction and detail refinement. The autoregressive branch captures high-level semantic structures while refining coarse predictions made by the parallel. Furthermore, we introduce Guided Diffusion Sampling to steer the diffusion denoising trajectory, significantly reducing the number of denoising steps. Extensive experiments demonstrate that our approach provides an effective solution for accelerating hybrid AR image generation models.