Skip to yearly menu bar Skip to main content


Poster

V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents

Zhengrong Yue · Shaobin Zhuang · Kunchang Li · Yanbo Ding · Yali Wang


Abstract: Despite the recent advancement in video stylization, most existing methods struggle to render any video with complex transitions,based on an open style description of user query.To fill this gap,we introduce a generic multi-agent system for video stylization, V-Stylist, by a novel collaboration and reflection paradigm of multi-modal large language models. Specifically, our V-Stylist is a systematical workflow with three key roles: (1) Video Parser decomposes the input video into a number of shots and generates their text prompts of key shot content.Via a concise video-to-shot prompting paradigm,it allows our V-Stylist to effectively handle videos with complex transitions. (2) Style Parser identifies the style in the user query and progressively search the matched style model from a style tree.Via a robust tree-of-thought searching paradigm,it allows our V-Stylist to precisely specify vague style preference in the open user query.(3) Style Artist leverages the matched model to render all the video shots into the required style.Via a novel multi-round self-reflection paradigm,it allows our V-Stylist to adaptively adjust detail control,according to the style requirement.With such a distinct design of mimicking human professionals, our V-Stylist achieves a major breakthrough over the primary challenges for effective and automatic video stylization. Moreover, we further construct a new benchmark Text-driven Video Stylization Benchmark (TVSBench),which fills the gap to assess stylization of complex videos on open user queries. Extensive experiments show that, V-Stylist achieves the state-of-the-art,e.g.,V-Stylist surpasses FRESCO and ControlVideo by 6.05\% and 4.51\% respectively in overall average metrics, marking a significant advance in video stylization.

Live content is unavailable. Log in and register to view live content