PG-VTON: Single-Pass Training-Free Virtual Try-On via Patch-Guided Reference Alignment
Abstract
Virtual try-on (VTON) aims to render a target garment onto a person while preserving pose, identity, and fine-grained appearance. Most existing methods rely on supervised paired data, limiting cross-domain generalization, while recent training-free approaches, though more robust, require multiple diffusion calls and complex compositing, making deployment impractical. We propose PG-VTON, a single-pass, training-free framework based on Patch-Guided Reference Alignment. Our key insight is that modern inpainting diffusion models already possess strong in-context completion: given a masked person and a small garment patch, they can synthesize plausible, pose-consistent clothing without task-specific training. PG-VTON exploits this capability with two lightweight components: Patch-Anchored Identity Priming (PIP) injects a localized garment patch only in early denoising steps to anchor garment identity, and Reference-Aware Attention (RAA) strengthens attention from masked-region tokens to garment tokens to enhance detail transfer, all without modifying model weights. With a single diffusion pass, PG-VTON achieves state-of-the-art performance among training-free methods on DressCode and VITON-HD and generalizes effectively to subject insertion.