MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
Abstract
Motion capture now underpins content creation far beyond digital humans, yet most pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation (e.g., BVH) that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware Inverse Kinematics (IK) Fitting. MoCapAnything comprises three learnable modules and a lightweight IK stage: a Reference Prompt Encoder that distills per-joint queries from the asset’s skeleton, mesh, and rendered image set; a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the modality gap between RGB tokens and the point-cloud–like joint space; and a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1,038 motion clips, each providing a standardized skeleton–mesh–rendered-video triad. Experiments on in-domain benchmarks and in-the-wild videos show that \name delivers high-quality skeletal animations and exhibits non-trivial cross-species retargeting across heterogeneous rigs, offering a scalable path toward prompt-based 3D motion capture for arbitrary assets.