Poster Sun, Jun 7, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 602

MM-ACT: Learn from Multimodal Parallel Generation to Act

Haotian Liang ⋅ Xinyi Chen ⋅ Bin Wang ⋅ MingKang Chen ⋅ Yitian Liu ⋅ Yuhao Zhang ⋅ Zanxin Chen ⋅ Tianshuo Yang ⋅ Yilun Chen ⋅ Jiangmiao Pang ⋅ Dong Liu ⋅ Xiaokang Yang ⋅ Yao Mu ⋅ Wenqi Shao ⋅ Ping Luo

Project Page

Abstract

A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action(VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context‑Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal task learning.Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances, respectively. Our approach achieves a success rate of 96.3\% on LIBERO, 62.2\% across four tasks of Franka, and 52.38\% across eight bimanual tasks of RoboTwin2.0, with an additional gain of 9.25\% from text-image co-training.