Poster Sat, Jun 6, 2026 • 3:45 PM – 5:45 PM PDT ExHall A & F 93

Action-Sketcher: From Reasoning to Action via Visual Sketches for Robotic Manipulation

Huajie Tan ⋅ Peterson Co ⋅ Yijie Xu ⋅ Shanyu Rong ⋅ Yuheng Ji ⋅ Cheng Chi ⋅ Xiansheng Chen ⋅ Zhongxia Zhao ⋅ Pengwei Wang ⋅ Zhongyuan Wang ⋅ Shanghang Zhang

Paper PDF

Abstract

Long-horizon, open-world robotic manipulation is increasingly important for real-world deployment, requiring spatial disambiguation in complex layouts and temporal resilience under dynamic interaction. However, existing end-to-end and hierarchical Vision–Language–Action (VLA) policies often rely on text-only cues while keeping plan intent latent, which undermines \textit{referential grounding} in cluttered or underspecified scenes, impedes effective \textit{task decomposition} of long-horizon goals with close-loop interaction, and limits \textit{causal explanation} by obscuring the rationale behind action choices. To address these issues, we first introduce \textbf{Visual Sketch}, a lightweight visual intermediate that renders points, boxes, arrows, and typed relations on the robot’s current views to externalize spatial intent, bind language to scene geometry, and provide a human-verifiable bridge between high-level reasoning and low-level control. Building on \textit{Visual Sketch}, we present \textbf{Action-Sketcher}, a VLA framework that operates in a cyclic \textit{See $\rightarrow$ Think $\rightarrow$ Sketch $\rightarrow$ Act} workflow coordinated by adaptive token-gated strategy for reasoning triggers, sketch revision, and action issuance, thereby supporting reactive corrections and human interaction while preserving real-time action prediction. To enable scalable training and evaluation, we curate a 2.3M-sample corpus with interleaved images, text, \textit{Visual Sketch} supervision, and action sequences, and train \textit{Action-Sketcher} with a multi-stage curriculum recipe that combines interleaved sequence alignment for modality unification, language-to-sketch consistency for precise linguistic grounding, and imitation learning augmented with sketch-to-action reinforcement for robustness. Experiments on cluttered tabletops and multi-object tasks, in simulation and on real robots, show improved long-horizon success, stronger robustness to dynamic scene changes, and enhanced interpretability via editable sketches and step-wise plans.