PRISM: Learning a Shared Primitive Space for Transferable Skeleton Action Representation
Abstract
Real-world human action understanding remains challenging due to long-tailed label distributions, compositional motion patterns, and viewpoint variations. Existing skeleton-based methods often lack a structured and transferable representation of motion, and task-specific models for generation, classification, and detection are usually trained independently, resulting in fragmented pipelines and limited cross-task generalization. We present PRISM, a PRImitive-centric Skeleton Modeling framework that learns a shared motion representation from a motion generation objective and transfers it to perception tasks. PRISM represents each action sequence as a trajectory in a primitive coefficient space, which captures how a set of learned atomic motion primitives contribute to the observed motion. A structured decomposition module learns this representation in a physically grounded and view-invariant manner via motion generation. Instead of enforcing joint or unified training across tasks, PRISM provides a single primitive-centric representation that can be sequentially transferred to downstream classification and frame-wise detection through lightweight task heads. This representation introduces structure, compositionality, and improved generalization across distinct supervisions. PRISM consistently improves performance on long-tailed and multi-label datasets and enables interpretable reasoning over compositional and rare actions. Extensive experimental results show that the structured primitive space serves as a transferable and robust foundation for diverse action understanding tasks in real-world datasets.