LAMP: Language-Assisted Motion Planning for Controllable Video Generation
Abstract
Recent advances in video generation have achieved remarkable progress in visual fidelity and controllability, enabling conditioning not only on text but also on structural layout and motion signals. Among these, motion control (i.e., specifying both object dynamics and camera trajectories) is particularly critical for directing complex, cinematic scenes, yet existing interfaces remain limited. To address this gap, we introduce LAMP that leverages large language models~(LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for both dynamic objects and (relatively defined) cameras. Specifically, we fine-tune an LLM to generate frame-wise 3D bounding-box trajectories for objects and, conditioned on these, produce corresponding 3D camera paths, which are then converted into generator-compatible 2D control signals. We enable this by constructing a large-scale paired datasets through a combination of procedurally generated text–trajectory pairs and augmented real video datasets with 3D annotations. Experiments demonstrate improved controllability and alignment with user intent compared to state-of-the-art alternatives, establishing the first framework for joint object–camera trajectory generation directly from natural language.