CVPR Poster Human Motion Instruction Tuning

Poster

Human Motion Instruction Tuning

Lei Li · Sen Jia · Jianhao Wang · Zhongyu Jiang · Feng Zhou · Ju Dai · Tianfang Zhang · Zongkai Wu · Jenq-Neng Hwang

ExHall D Poster #170

[ Abstract ]

Sat 14 Jun 3 p.m. PDT — 5 p.m. PDT

Abstract: This paper presents

$\textbf{LLaMo}$ (

$\textbf{L}$ arge

$\textbf{La}$ nguage and Human

$\textbf{Mo}$ tion Assistant), a multimodal framework for human motion instruction tuning. In contrast to conventional instruction-tuning approaches that convert non-linguistic inputs, such as video or motion sequences, into language tokens, LLaMo retains motion in its native form for instruction tuning. This method preserves motion-specific details that are often diminished in tokenization, thereby improving the model’s ability to interpret complex human behaviors. By processing both video and motion data alongside textual inputs, LLaMo enables a flexible, human-centric analysis. Experimental evaluations across high-complexity domains, including human behaviors and professional activities, indicate that LLaMo effectively captures domain-specific knowledge, enhancing comprehension and prediction in motion-intensive scenarios. We hope LLaMo offers a foundation for future multimodal AI systems with broad applications, from sports analytics to behavioral prediction.

Live content is unavailable. Log in and register to view live content