EvoID: Reinforced Evolution for Identity-Preserving Video Generation
Abstract
We present EvoID, a novel framework that reformulates Identity-Preserving Video Generation as a self-evolving process through Reinforcement Learning. Moving beyond the static paradigm of imitation learning, EvoID enables a generative model to actively learn and optimize the complex trade-offs between identity fidelity, motion naturalness, and temporal coherence. At the heart of our EvoID is a dynamic, dual-path reward mechanism, which acts as an intrinsic critic by adaptively combining objective metric indicators and MLLM-based holistic quality assessment. This allows the model to "evolve" its generation strategy, focusing on different aspects of quality at different stages of training. To ensure stable and coherent evolution, we anchor the exploring Student model with a frozen Teacher, preserving robust world priors while allowing for creative refinement when generating videos. Extensive experiments demonstrate the superiority of our proposal, and EvoID achieves the total score of 0.687 on the Human-Domain of OpenS2V-Eval dataset, surpassing 0.658 of the open-source VACE and 0.653 of the commercial Hailuo. Moreover, EvoID also obtains a new record of 0.718 on our newly minted MLLM-based metric, prioritizing human perception and more comprehensively reflecting video quality.