Plan, Imagine, then Act: Steering Your VLA with Efficient Visually Grounded Planning
Zhuoyang Zhang ⋅ Shang Yang ⋅ Qinghao Hu ⋅ Luke Huang ⋅ James Hou ⋅ Yufei Sun ⋅ Yao Lu ⋅ Song Han
Abstract
Vision-Language-Action (VLA) models convert abstract language instructions into concrete, executable actions, a task that is especially challenging in open-world environments. We present \textit{Visually Grounded Planning}, a general and efficient high-level planner that guides a VLA step-by-step using imagined future observations and subtask descriptions. With an imagined future observation, the VLA can focus on visuomotor inference rather than high-level semantic reasoning, leading to improved accuracy and generalization. Our planner comprises a highly efficient foresight image-generation module that predicts a high-quality 640×480 future observation from the current visual input and language instruction within only 0.33 s on an H100 GPU, together with a vision–language component that reasons over the task and produces subtask descriptions for both the generator and the VLA. Importantly, state-of-the-art VLAs can integrate our planner seamlessly by simply augmenting their visual inputs, without any architectural modification. The foresight generator is pretrained on approximately 10 million multi-task, cross-embodiment samples, enabling it to learn robust embodied dynamics and achieve strong real-world generalization. We evaluate our framework on a benchmark consists of 11 diverse, multi-step real-world tasks. It achieves an average success rate of 87.4\%, demonstrating a +40.9\% absolute improvement over the $\pi_0$ baseline (46.5\%) and a +30.3\% absolute improvement over $\pi_0$ augmented with textual subtask guidance (57.1\%).
Successful Page Load