Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning
Abstract
We present Narrative Weaver, a novel framework that addresses a fundamental challenge in generative AI: achieving controllable, long-range, and consistent visual content generation. While existing models excel at generating high-fidelity short-form visual content, they struggle to maintain narrative coherence and visual consistency across extended sequences—a critical limitation for real-world applications such as filmmaking and e-commerce advertising.Narrative Weaver introduces the first holistic solution that seamlessly integrates three essential capabilities: fine-grained control, automatic narrative planning, and long-range coherence. Our architecture combines a Multimodal Large Language Model (MLLM) for high-level narrative planning with a novel fine-grained control module featuring a dynamic Memory Bank that prevents visual drift.To enable practical deployment, we develop a progressive, multi-stage training strategy that efficiently leverages existing pre-trained models, achieving state-of-the-art performance even with limited training data. Recognizing the absence of suitable evaluation benchmarks, we construct and release the E-commerce Advertising Video Storyboard Dataset (EAVSD)—the first comprehensive dataset for this task, containing over 330K high-quality images with rich narrative annotations.Through extensive experiments across three distinct scenarios (controllable multi-scene generation, autonomous storytelling, and e-commerce advertising), we demonstrate our method’s superiority while opening new possibilities for AI-driven content creation.