Poster Fri, Jun 5, 2026 • 9:45 AM – 11:45 AM PDT ExHall A-F 90

CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics

Andrew Jeong ⋅ Jaemin Kim ⋅ Sebin Lee ⋅ Sung-Eui Yoon

Highlight

Project Page

Abstract

We propose CLaD (Cross-modal Latent Dynamics), a framework for learning temporally consistent cross-modal representations in robotic manipulation. Our approach models transition dynamics rather than static state correspondences: asymmetric cross-attention enables proprioceptive transitions to query semantic ones, extracting shared dynamics structure that respects the causal ordering imposed by actions. We formalize grounded latent foresight as predictions anchored through EMA-based targets from observed trajectories and auxiliary reconstruction to observable space—preventing collapse to abstract representations. A diffusion policy conditions on these learned foresights via feature modulation, decoupling dynamics learning from control optimization. Evaluated on LIBERO-LONG, our method achieves 94.9\% success with 0.66B parameters, demonstrating that explicit cross-modal transition modeling enables parameter-efficient planning outperforming larger VLAs.