AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References
Abstract
Identity-preserving video generation offers powerful tools for creative expression, allowing users to customize videos featuring their beloved characters. However, prevailing methods are typically designed and optimized for a single identity reference. This underlying assumption introduces two significant limitations: it curtails creative flexibility by poorly accommodating diverse, real-world input formats, and more critically, it compromises identity fidelity. Relying on a single source is an ill-posed setting, and provides an inherently ambiguous foundation, making it difficult for the model to faithfully reproduce an identity across novel contexts. In response, we present AnyID, an ultra-fidelity identity-preservation video generation framework. Our approach makes two core contributions. First, we introduce a scalable omni-referenced architecture that effectively unifies heterogeneous identity inputs (e.g., faces, portraits, and videos) into a cohesive representation. Second, we propose a primary-referenced generation paradigm, which designates one reference as a canonical anchor and uses a novel differential prompt to enable precise, attribute-level controllability. The model is trained on a large-scale, meticulously curated dataset to ensure robustness and high fidelity. In addition, we perform a final fine-tuning stage using reinforcement learning. This process leverages a preference dataset constructed from human evaluations, where annotators performed pairwise comparisons of videos based on two key criteria: identity fidelity and prompt controllability. Extensive evaluations validate that AnyID achieves ultra-high identity fidelity as well as superior attribute-level controllability across different task settings. All the codes, data and models will be publicly released.