High-Fidelity Virtual Try-On beyond Paired Data Scarcity via Diffusion-based Cycle-Consistent Learning
Abstract
Diffusion-based virtual try-on methods rely on vast high-quality garment-person pairs, which are scarce in practice due to the high cost of data collection and preprocessing, limiting their performance in real-world scenarios.To overcome this bottleneck, we propose Cycle-Consistent Virtual Try-On (CCVTON), a diffusion-based approach that enables effective training using massive in-the-wild person images. Specifically, CCVTON introduces a Cycle-Consistent Learning (CCL) strategy that just employs a single unified generative model to disentangle a garment from a person image (try-off branch) and transfer it to the same individual (try-on branch), forming a reconstruction cycle. To this end, we first warm up a Unified Diffusion Transformer (UDiT) on open-source paired data to acquire basic try-on and try-off capabilities. When adapting UDiT to in-the-wild person images, we employ a Multi-Criteria Filtering Operation to select high-quality garments disentangled from person images by the pretrained UDiT. These filtered garments are not used as inputs for CCL, but serve as soft constraints for a perceptual regularization loss, preventing the try-off branch from collapsing to trivial copying. In addition, we propose a garment-aware mask generation with a two-stage refinement process to suppress garment leakage while maintaining person consistency.Extensive experiments show that CCVTON achieves state-of-the-art results.