Learning a Unified Latent Action Space from Videos with Action-centric Cycle Consistency
Abstract
Video data provides a rich source beyond expensive action-labeled data for advancing robot learning. Recent approaches have demonstrated promising potential in leveraging video data by learning latent actions for policy training. The latent action tokenizer encodes latent actions between successive video frames, and the tokenizer is trained to reconstruct future frames using current frames and the encoded latent actions. However, the unique pairing of successive frames permits future frame reconstruction with little understanding of transition dynamics, hindering the learning of semantically consistent latent actions. Moreover, the tokenizer typically allocates distinct latent action subsets to individual embodiments to accommodate heterogeneous morphologies, constraining knowledge transfer. To overcome such limitations, we propose the action-centric cycle consistency, aiming to establish a unified latent action space. Our method samples latent actions from the latent action space and decodes them with video frames to generate diverse subsequent frames, then enforces cycle consistency by predicting the sampled actions from both original and generated frames. Our concise method creates a challenging task that learns corresponding latent actions from current frames and diverse generated future frames, compelling the tokenizer to develop semantically consistent action representations. Additionally, sampled latent actions can be applied to video frames from distinct embodiments, facilitating the alignment of latent actions across embodiments. Experiments demonstrate that our approach achieves a 20.1% improvement over OpenVLA on the LIBERO benchmark and increases the average length from 3.27 to 3.93 on the CALVIN benchmark. In real-world experiments, our method maintains strong performance with a 44% improvement.