MoCha: End-to-End Video Character Replacement without Structural Guidance
Abstract
Controllable video character replacement with a user-provided identity remains a challenging problem due to the lack of paired video data.Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skeleton, depth). This reliance, however, severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, often leading to visual artifacts and temporal inconsistencies.In this paper, we propose MoCha, a pioneering framework that mitigates these limitations by harnessing the inherent tracking ability of the video diffusion model, therefore requiring only a single arbitrary frame mask and no structural guidance.To effectively adapt the multi-modal input condition and enhance facial identity, we introduce a condition-aware RoPE and employ an RL-based post-training stage.Furthermore, to overcome the scarcity of qualified paired-training data, we propose a comprehensive data construction pipeline. Specifically, we design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5), an expression-driven dataset synthesized by current portrait animation techniques, and an augmented dataset derived from existing video-mask pairs.Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches.We will release the code and dataset to facilitate further research.