Skip to yearly menu bar Skip to main content


ChatPose: Chatting about 3D Human Pose

Yao Feng · Jing Lin · Sai Kumar Dwivedi · Yu Sun · Priyanka Patel · Michael J. Black

Arch 4A-E Poster #186
[ ] [ Project Page ]
Wed 19 Jun 10:30 a.m. PDT — noon PDT


We introduce PoseGPT, a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description, a process that intertwines image interpretation, world knowledge, and an understanding of body language. Traditional human pose estimation methods, whether image-based or text-based, often lack holistic scene comprehension and nuanced reasoning, leading to a disconnect between visual data and its real-world implications. PoseGPT addresses these limitations by embedding SMPL poses as a distinct signal token within a multi-modal LLM, enabling direct generation of 3D body poses from both textual and visual inputs. This approach not only simplifies pose prediction but also empowers LLMs to apply their world knowledge in reasoning about human poses, fostering two advanced tasks: speculative pose generation and reasoning about pose estimation. These tasks involve generating human poses from subtle text queries, possibly accompanied by images, after comprehensive reasoning. We establish benchmarks for these tasks, moving beyond the confines of traditional pose generation and estimation methodologies. Our results show that PoseGPT outperforms existing multimodal LLMs and task-sepcific methods on these newly proposed tasks. Furthermore, PoseGPT's ability to understand and generate 3D human poses based on complex reasoning opens new directions in human pose analysis. We will release the models and training code for research purposes.

Live content is unavailable. Log in and register to view live content