Recovering Physically Plausible Human-Object Interactions from Monocular Videos
Abstract
In this paper, we present a method to reconstruct physically plausible human-object interactions (HOI) from monocular videos. While existing kinematic-based approaches produce visually plausible motion, they often result in physical artifacts such as interpenetration and object floating. To overcome these issues, we introduce a physics-guided reconstruction framework that begins with a kinematic estimate and then refines it through a reinforcement learning (RL) policy trained to reproduce the interaction in a physics simulator. Because kinematic estimates are typically noisy, naive RL training can fail. Therefore, we propose an adaptive sampling strategy with a dual self-updating mechanism that automatically identifies the frames with the most informative and reliable kinematic reconstruction. Our process progressively improves reconstruction quality and yields physically consistent HOI sequences. We demonstrate our approach on two standard benchmarks and achieve clear improvements in physical plausibility metrics over state-of-the-art methods.