Poster Sun, Jun 7, 2026 • 2:30 PM – 4:30 PM PDT ExHall A 95

ForeAct: Steering Your VLA with Efficient Visual Foresight Planning

Zhuoyang Zhang ⋅ Shang Yang ⋅ Qinghao Hu ⋅ Luke J. Huang ⋅ James Hou ⋅ Yufei Sun ⋅ Yao Lu ⋅ Song Han

Highlight

Paper PDF

Abstract

Vision-Language-Action (VLA) models convert abstract language instructions into concrete, executable actions, a task that is especially challenging in open-world environments. We present \textit{Visually Grounded Planning}, a general and efficient high-level planner that guides a VLA step-by-step using imagined future observations and subtask descriptions. With an imagined future observation, the VLA can focus on visuomotor inference rather than high-level semantic reasoning, leading to improved accuracy and generalization. Our planner comprises a highly efficient foresight image-generation module that predicts a high-quality 640×480 future observation from the current visual input and language instruction within only 0.33 s on an H100 GPU, together with a vision–language component that reasons over the task and produces subtask descriptions for both the generator and the VLA. Importantly, state-of-the-art VLAs can integrate our planner seamlessly by simply augmenting their visual inputs, without any architectural modification. The foresight generator is pretrained on approximately 10 million multi-task, cross-embodiment samples, enabling it to learn robust embodied dynamics and achieve strong real-world generalization. We evaluate our framework on a benchmark consists of 11 diverse, multi-step real-world tasks. It achieves an average success rate of 87.4\%, demonstrating a +40.9\% absolute improvement over the $\pi_0$ baseline (46.5\%) and a +30.3\% absolute improvement over $\pi_0$ augmented with textual subtask guidance (57.1\%).