Poster
VideoAlchemy: Open-set Personalization in Video Generation
Tsai-Shien Chen · Aliaksandr Siarohin · Willi Menapace · Yuwei Fang · Kwot Sin Lee · Ivan Skorokhodov · Kfir Aberman · Jun-Yan Zhu · Ming-Hsuan Yang · Sergey Tulyakov
[
Abstract
]
Abstract:
Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present a video model equipped with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each reference image conditioning and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: and . First, as paired datasets of reference images and videos are extremely hard to collect, we opt to sample selected video frames as reference images and synthesize a clip of the target video. This approach, however, introduces a data bias issue, where models can easily denoise training videos but fail to generalize to new contexts during inference. To mitigate this issue, we carefully design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a new personalization benchmark with evaluation protocols focusing on accurate subject fidelity assessment and accommodating different types of personalization conditioning. Finally, our extensive experiments show that significantly outperforms existing personalization methods, regarding quantitative and qualitative evaluations.
Live content is unavailable. Log in and register to view live content