Your Latent Mask is Wrong: Pixel-Equivalent Latent Compositing for Diffusion Models
Abstract
Linearly interpolating between VAE latents using a downsampled mask field remains a common heuristic for diffusion inpainting. However, this approach systematically violates a key principle: latent compositing must respect pixel equivalence; compositing latents must approximate compositing pixels. Because VAE latents capture global context rather than pixel-local structure, linear interpolation fails this requirement, producing seams, color shifts, and halos that diffusion subsequently amplifies into larger artifacts.We propose Pixel-Equivalent Latent Compositing (PELC) and instantiate it with DecFormer, a 7.7M-parameter transformer that predicts per-channel blend weights and a nonlinear residual to realize mask-consistent latent fusion. DecFormer is trained so that decoding after fusion matches pixel-space alpha compositing, is plug-compatible with existing diffusion pipelines, requires no backbone finetuning and adds only 0.07\% of FLUX.1-Dev’s parameters and 3.5\% FLOP overhead.On the FLUX.1 family, DecFormer restores global color consistency, soft-mask support, sharp boundaries, and high-fidelity masking, reducing error metrics around edges by up to 53\% over standard mask interpolation. Used as an inpainting prior, a lightweight LoRA on FLUX.1-Dev with DecFormer achieves fidelity comparable to FLUX.1-Fill, a fully finetuned inpainting model. While we focus on inpainting, PELC is a general recipe for pixel-equivalent latent editing (e.g., overlays, tone/relighting, warps), as we demonstrate on a complex color-correction task.