Δynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos
Chia-Hsiang Kao ⋅ Cong Phuoc Huynh ⋅ Chien-Yi Wang ⋅ Noranart Vesdapunt ⋅ Stefan Stojanov ⋅ Bharath Hariharan ⋅ Oleksandr Obiednikov ⋅ Ning Zhou
Abstract
Inferring rigid-body physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, which are unable to generalize to complex real-world settings. We introduce Δynamics, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, Δynamics generates scene configurations in a structured text format for physics simulation. We enhance the model's generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, Δynamics achieves a segmentation IoU of $0.30$, a $7\times$ improvement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Further, test-time sampling and evolutionary search further boost performance by 27% and 120% in segmentation IoU, respectively. Finally, we demonstrate strong transfer to a new dataset of $235$ real-world rigid-body videos, highlighting the potential of language-driven physics inference for bridging perception and simulation.
Successful Page Load