CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning
Abstract
Unsupervised learning of latent motion from Internet videos is crucial for building generalist robots. Existing discrete methods generally mitigate the shortcut learning problem caused by extracting excessive static background information through vector quantization with a small codebook size. However, they suffer from information loss and struggle to capture more complex and fine-grained dynamics. Moreover, there is an inherent gap between the distribution of discrete latent motion and continuous robot action, which hinders the joint learning of a unified policy. We propose CoMo, which aims to learn more precise continuous latent motion from internet-scale videos. CoMo employs an early temporal difference (Td) mechanism to increase the difficulty of shortcut learning and explicitly enhance motion cues. Additionally, to ensure that continuous latent motion better captures meaningful foreground information, we further propose a temporal contrastive learning (Tcn) scheme. Specifically, positive pairs are constructed from motion representations with a small future frame temporal offset, while negative pairs are formed by directly reversing the temporal direction. The proposed Td and Tcn work synergistically and effectively ensure that the latent motion focuses better on the foreground and reinforces motion cues. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate effective pseudo action labels for unseen videos. The shared continuous distribution of robot action and video latent motion also significantly benefits the joint learning of unified policy. Extensive simulated and real-world experiments show that policies co-trained with CoMo pseudo action labels achieve superior performance with both diffusion and autoregressive architectures.