SemanticVLA: Towards Semantic Reasoning over Action Memorization via Synergistic Explicit Trace and Latent Action Planning
Abstract
Vision-Language-Action (VLA) models have emerged as a promising paradigm where pretrained Vision-Language Models (VLMs) serve as System 2 for high-level reasoning, connected to action experts as System 1 for low-level motor control.However, current works fail to genuinely leverage VLM capabilities: VLMs produce latent embeddings that lack semantic interpretability, providing ambiguous and unstable guidance to downstream policies, while solely action supervision further causes VLMs to degenerate into parameter-heavy fusion encoders that memorize action patterns rather than perform generalized reasoning.To bridge this gap, we introduce SemanticVLA, which leverages VLM reasoning through synergistic dual-path design. Explicit trace reasoning generates interpretable spatial waypoints as textual coordinate sequences through the VLM's native language interface, directly reusing its pretrained spatial grounding to provide a "thinking process" for task planning. Latent action tokens complement trace reasoning by learning compact visuomotor primitives grounded in visual observations, providing more fine-grained action representations beyond pure coordinate prediction. This synergy enables trace reasoning to leverage VLM's multimodal understanding for refining latent token prediction, while latent tokens provide stable and grounded guidance that compensates for trace's numerical sensitivity.SemanticVLA achieves 97.0\% average success rate on LIBERO and 65.1\% on SimplerEnv WidowX, substantially outperforming strong baselines. More importantly, SemanticVLA maintains significantly more stable performance under instruction rephrasing in both simulation suites, and demonstrates strong advantages on real-world long-horizon and reasoning-intensive tasks.By bridging VLM reasoning and action expert through semantically explicit trace and visually grounded latent action tokens, our approach enables genuine reasoning rather than action memorization.