Masked Region Transformer for Layered Image Generation and Editing at Scale
Abstract
Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of the generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present the Masked Region Transformer, a 20B-parameter diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make three key technical contributions. First, we unify three complementary tasks---text-to-layers, image-to-layers, and layers-to-layers---within a shared masked region diffusion framework, where selective token masking enables flexible cross-modal generation and fine-grained layer-wise editing. Second, we design an efficient conditional diffusion decoder that incorporates Gated DeltaNet and gated attention mechanisms, enhancing visual fidelity while maintaining computational efficiency. Third, we introduce an overflow-aware canvas layer to handle boundary inconsistencies and support semi-transparent background synthesis, enabling complete editable layer generation beyond visible canvas boundaries. Additionally, we apply distribution matching distillation to achieve one-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches across all three tasks, establishing a new benchmark for region-aware transparent image generation.