Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World
Yuzhi Huang ⋅ Kairun Wen ⋅ Rongxin Gao ⋅ Dongxuan Liu ⋅ Yibin Lou ⋅ Jie Wu ⋅ Jing Xu ⋅ Jian Zhang ⋅ Zheng Yang ⋅ yunlong lin ⋅ Chenxin Li ⋅ Panwang Pan ⋅ Junbin Lu ⋅ Jingyan Jiang ⋅ Xinghao Ding ⋅ Yue Huang ⋅ Zhi Wang
Abstract
Humans inhabit a physical $\textbf{4D world}$, where spatial geometry and semantic content evolve over time, forming a dynamic reality. While current Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in understanding static visual inputs, it remains unclear whether they can effectively $\textbf{"think in dynamics," $\textit{i.e.}$, perceive, track, and reason}$ about spatio-temporal evolution in complex scenes.To systematically evaluate these abilities, we introduce $\texttt{Dyn-Bench}$, a large-scale benchmark designed to assess spatio-temporal reasoning and localized dynamics perception. Constructed through multi-stage filtering over massive 2D and 4D data sources, $\texttt{Dyn-Bench}$ provides a high-quality collection of diverse dynamic scenes, consisting of $\textbf{1k videos}$, $\textbf{7k visual question answering (VQA) pairs}$, and $\textbf{3k dynamic object grounding samples}$.We comprehensively study general-purpose, spatial-aware, and region-level MLLMs to understand how they ``think in dynamics'' from both linguistic and visual perspectives. Our results reveal that existing models struggle to jointly excel in both $\textbf{spatio-temporal reasoning}$ and $\textbf{dynamic object grounding}$, often producing inconsistent interpretations of motion and interaction. Conventional prompting strategies $\textit{i.e.}$, chain-of-thought or caption-based hints) provide only limited improvements.In contrast, structured integration approaches, including $\textbf{Mask-Guided Fusion}$ and the $\textbf{Spatio-Temporal Textual Cognitive Map (ST-TCM)}$, substantially enhance MLLMs' dynamic perception and spatio-temporal reasoning in an evolving $\textbf{4D world}$. These findings underscore the importance of explicit spatio-temporal structural cues to bridge the gap between static perception and dynamic reasoning in MLLMs.
Successful Page Load