Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation
Jiangning Zhang ⋅ junwei zhu ⋅ Zhenye Gan ⋅ Donghao Luo ⋅ Chuming Lin ⋅ FeiFan Xu ⋅ Xu Peng ⋅ Jianlong Hu ⋅ Yuansen Liu ⋅ Yijia Hong ⋅ Weijian Cao ⋅ Han Feng ⋅ Xu Chen ⋅ Chencan Fu ⋅ Keke He ⋅ Xiaobin Hu ⋅ Chengjie Wang
Abstract
We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed Soul, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4$\times$ speedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video–text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production.
Successful Page Load