Semantic Context Matters: Improving Conditioning for Autoregressive Models
Abstract
Recently, autoregressive (AR) models have shown strong potential in image generation, offering better scalability and easier integration with unified multi-modal models compared to diffusion methods.However, extending AR models to controllable image editing remains challenging due to weak and inefficient conditioning strategies, which often lead to suboptimal semantic alignment and visual quality.To address this limitation, we present SCAR, a Semantic-Context-driven method for AutoregRessive models.SCAR introduces Compressed Semantic Prefilling and Semantic Alignment Guidance that jointly enhance contextual understanding and generation coherence. Unlike prior methods that rely on sparse visual tokens or decoding stage injection, SCAR enables strong semantic guidance from the input stage, while remaining model-agnostic and applicable to both next-token and next-scale AR paradigms.Extensive experiments on instruction editing and controllable generation demonstrate that our method significantly improves visual fidelity and semantic alignment, outperforming existing AR-based methods while maintaining controllability.All the code will be released.