Skip to yearly menu bar Skip to main content


Semantics-aware Motion Retargeting with Vision-Language Models

Haodong Zhang · ZhiKe Chen · Haocheng Xu · Lei Hao · Xiaofei Wu · Songcen Xu · Zhensong Zhang · Yue Wang · Rong Xiong

Arch 4A-E Poster #192
[ ]
Wed 19 Jun 10:30 a.m. PDT — noon PDT


Capturing and preserving motion semantics is essential to motion retargeting between animation characters. However, most of the previous works neglect the semantic information or rely on human-designed joint-level representations. Here, we present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics. We utilize a differentiable module to render 3D motions. Then the high-level motion semantics are incorporated into the motion retargeting process by feeding the vision-language model with the rendered images and aligning the extracted semantic embeddings. To ensure the preservation of fine-grained motion details and high-level semantics, we adopt a two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints. Experimental results show the effectiveness of the proposed method in producing high-quality motion retargeting results while accurately preserving motion semantics. Project page can be found at

Live content is unavailable. Log in and register to view live content