Gloria: Consistent Character Video Generation via Content Anchors
Yuhang Yang ⋅ Fan Zhang ⋅ Huaijin Pi ⋅ Ailing Zeng ⋅ Shuai Guo ⋅ Guowei Xu ⋅ Wei Zhai ⋅ Yang Cao ⋅ Zheng-Jun Zha
Abstract
Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the "memory", leading to suboptimal consistency.Recognizing that character video generation inherently resembles an ``outside-looking-in" scenario. In this work, we propose represent the character’s visual attributes through a compact set of anchor frames.This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors.Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding $10$ minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.
Successful Page Load