MoEActok: A MoE-based Action Tokenizer for Vision-Language-Action Models
Abstract
Recent works on vision-language-action (VLA) models have made great progress in exploring action tokenizers that convert continuous control signals into discrete tokens to align with LLM/VLM training paradigms.These approaches typically train a single tokenizer over entire manipulation trajectories, which often comprise multiple distinct skills and thus pose a challenging optimization trade-off.To address this issue, we introduce MoEActok, a novel action tokenizer that employs a mixture-of-experts (MoE) quantizer to produce skill-aware discrete representations for VLA models. MoEActok utilizes a clustering-driven MoE VQ-VAE mechanism in which each expert specializes in a particular skill.The key components are: (a) an action-skill decoupling strategy that uses k-means clustering to group action chunks, aligning clusters having similar skills; (b) a skill-aware training paradigm that augments VLA models with skill-conditioned context, improving skill grounding; and (c) an adapter that projects shared encoder representations into skill-specific latent spaces for specialized quantization, and subsequently harmonize the heterogeneous quantized representations back into a unified space for coherent reconstruction by the shared decoder.We evaluate MoEActok-based VLA models against multiple prior action tokenizer baselines in the RoboTwin and Simpler-Env simulators, and further assess zero-shot transfer on three real-world tasks. Across both simulated and real-world settings, MoEActok-based VLA substantially outperforms existing discrete tokenization methods.