Poster

Scaling Up Dynamic Human-Scene Interaction Modeling

Nan Jiang ⋅ Zhiyuan Zhang ⋅ Hongjie Li ⋅ Xiaoxuan Ma ⋅ Zan Wang ⋅ Yixin Chen ⋅ Tengyu Liu ⋅ Yixin Zhu ⋅ Siyuan Huang

Highlight

2024 Poster

Paper PDF [ Poster] [Paper PDF]

Abstract

The advancing of human-scene interaction modeling confronts substantial challenges in the scarcity of high-quality data and advanced motion synthesis methods. Previous endeavors have been inadequate in offering sophisticated datasets that effectively tackle the dual challenges of scalability and data quality. In this work, we overcome these challenges by introducing TRUMANS, a large-scale \mocap dataset created by efficiently and precisely replicating the synthetic scenes in the physical environment. TRUMANS, featuring the most extensive motion-capturedhuman-scene interaction datasets thus far, comprises over 15 hours of diverse human behaviors, including concurrent interactions with dynamic and articulated objects, across 100 indoor scene configurations. It provides accurate pose sequences of both humans and objects, ensuring a high level of contact plausibility during the interaction. Additionally, we also propose a data augmentation approach that automatically adapts collision-free and interaction-precise human motions, significantly increasing the diversity of both interacting objects and scene backgrounds.Leveraging the benefits of TRUMANS, we propose a novel approach that employs a diffusion-based autoregressive mechanism for the real-time generation of human-scene interaction sequences with arbitrary length.The efficacy of TRUMANS and our motion synthesis method is validated through extensive experimental results, surpassing all existing baselines in terms of quality and diversity. Notably, our method demonstrates superb zero-shot generalizability on existing 3D scene datasets eg, PROX, Replica, ScanNet, ScanNet++, capable of generating even more realistic motions than the ground-truth annotations on PROX. Our human study further indicates that our generated motions are almost indistinguishable from the original motion-captured sequences, highlighting their superior quality. Our dataset and model will be released for research purposes.

Chat is not available.