RigMo: Unifying Rig and Motion Learning for Generative Animation
Abstract
Recent progress in 4D generation has advanced the reconstruction of dynamic geometry, yet the modeling of rig and motion, the two core elements of animation, remains disconnected. Existing approaches typically treat rigging and motion generation as independent tasks: auto-rigging methods rely on human-annotated skeletons and skinning weights, while motion-generation models predict dense vertex trajectories without any explicit structure. This separation contradicts the nature of animation itself, which is the coupled outcome of both structure and motion, and it limits scalability, interpretability, and control.We present RigMo, a unified generative framework that jointly learns rig and motion directly from raw mesh sequences without any rig annotations or human priors. RigMo encodes per-vertex deformations into a compact latent space and decodes a set of implicit Gaussian bones, skinning weights, and time-varying transformations that together define an animatable mesh. This design makes the model animatable by construction: a single latent representation yields both an explicit rig structure and temporally coherent motion parameters. Unlike optimization-based auto-rigging methods that overfit to a specific sequence, RigMo generalizes across object categories and motion styles, offering feed-forward inference for arbitrary deformable objects. Experiments on DeformingThings4D, Objaverse-XL, and diverse human and animal datasets demonstrate that RigMo generates smooth, interpretable, and physically consistent rigs, achieving superior reconstruction and generalization compared to existing 4D generative baselines. RigMo establishes a new paradigm for structure-aware, controllable, and scalable 4D generation.