Enhancing Vision Language Models for 4D Perception
Abstract
Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about 3D motion, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection on 2D images. Second, existing datasets fail to disentangle object and camera motion. To address these, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate large-scale 400K training samples and a 2.2K-sample benchmark. Training existing models on our dataset yields performance improvements on an external benchmark, validating the effectiveness of our method.