CLEP: Contrastive Language-Pose Pretraining
Abstract
Aligning natural language descriptions with precise 3D human poses remains a big challenge due to the scarcity of effective pose representation mechanisms and large-scale, semantically rich datasets. To overcome these limitations, we first introduce CLEP-2M, the largest 3D pose-language dataset to date, comprising two million high-quality 3D pose-language pairs. This dataset provides a 20-fold increase in scale and far richer semantic diversity than existing benchmarks. Second, we propose CLEP, a novel contrastive pretraining framework. The core of CLEP is HierFormer, a hierarchical pose encoder specifically designed for language alignment. Its key innovation is a Cross-Scale Attention Fusion (CSAF) mechanism that dynamically integrates features from the joint, limb, and body levels. This enables CLEP to precisely align complex, multi-scale text descriptions with the pose representation. Extensive experimental evaluations on CLEP-2M and PoseScript demonstrate that our method consistently outperforms existing approaches across a range of downstream tasks. CLEP shows exceptional zero-shot generalization, achieving a 34.8 mRecall on the human-annotated PoseScript-H benchmark—a nearly 6-fold improvement from the baseline. Furthermore, CLEP demonstrates superior performance on pose generation and fine-grained pose editing. These results establish CLEP as a strong multimodal foundation model for human-centric understanding and generation tasks.