MagicQuill V2: Precise and Interactive Image Editing with Layered Visual Cues
Abstract
We propose MagicQuill V2, a novel framework that introduces a layered composition paradigm to generative image editing, bridging the gap between the semantic power of modern diffusion models and the granular control of traditional graphics software. While state-of-the-art diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and style. To overcome this limitation, our method deconstructs creative intent into a stack of independently controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive, and powerful control over the generative process.