CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics
Abstract
We propose CLaD (Cross-modal Latent Dynamics), a framework for learning temporally consistent cross-modal representations in robotic manipulation. Our approach models transition dynamics rather than static state correspondences: asymmetric cross-attention enables proprioceptive transitions to query semantic ones, extracting shared dynamics structure that respects the causal ordering imposed by actions. We formalize grounded latent foresight as predictions anchored through EMA-based targets from observed trajectories and auxiliary reconstruction to observable space—preventing collapse to abstract representations. A diffusion policy conditions on these learned foresights via feature modulation, decoupling dynamics learning from control optimization. Evaluated on LIBERO-LONG, our method achieves 94.9\% success with 0.66B parameters, demonstrating that explicit cross-modal transition modeling enables parameter-efficient planning outperforming larger VLAs.