Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling
Abstract
Current 3D human animation methods fail at photorealism: kinematics-based approaches lack non-rigid dynamics like clothing, while methods reconstructing from generated videos suffer from low-quality artifacts and identity loss. To overcome these limitations, we present Ani3DHuman, a framework that marries kinematics-based animation with video diffusion priors. We first introduce a layered motion representation that disentangles rigid motion and residual non-rigid motion. Then, we use a pretrained video diffusion model to restore a coarse rendering from the mesh-rigged animation, which provides supervision for the motion field. However, this restoration task, based on diffusion sampling, is highly challenging, as the initial renderings are out-of-distribution, causing standard deterministic ODE samplers to fail. Therefore, our core technical contribution is self-guided stochastic sampling, which effectively solves the out-of-distribution problem by combining stochastic sampling (for photorealistic quality) with self-guidance (for identity fidelity). These restored videos provide high-quality supervision, enabling the optimization of a realistic 4D motion field. Ani3DHuman achieves state-of-the-art results, and our ablations validate that both components of our sampler are essential for high-fidelity restoration.