Skip to yearly menu bar Skip to main content


M&M VTO: Multi-Garment Virtual Try-On and Editing

Luyang Zhu · Yingwei Li · Nan Liu · Hao Peng · Dawei Yang · Ira Kemelmacher-Shlizerman

Arch 4A-E Poster #115
award Highlight
[ ] [ Project Page ]
Wed 19 Jun 10:30 a.m. PDT — noon PDT

Abstract: We present M&M VTO–a mix and match virtual try-on method that takes as input multiple garment images, text description for garment layout and an image of a person. An example input includes: an image of a shirt, an image of a pair of pants, "rolled sleeves, shirt tucked in", and an image of a person. The output is a visualization of how those garments (in the desired layout) would look like on the given person. Key contributions of our method are: 1) a single stage diffusion based model, with no super resolution cascading, that allows to mix and match multiple garments at $1024\mathord\times\mathord512$ resolution preserving and warping intricate garment details, 2) architecture design (VTO UNet Diffusion Transformer) to disentangle denoising from person specific features, allowing for a highly effective finetuning strategy for identity preservation (6MB model per individual vs 4GB achieved with, e.g., dreambooth finetuning); solving a common identity loss problem in current virtual try-on methods, 3) layout control for multiple garments via text inputs specifically finetuned over PaLI-3 for virtual try-on task. Experimental results indicate that M&M VTO achieves state-of-the-art performance both qualitatively and quantitatively, as well as opens up new opportunities for virtual try-on via language-guided and multi-garment try-on.

Live content is unavailable. Log in and register to view live content