Poster Sat, Jun 6, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 175

Inference-time Physics Alignment of Video Generative Models with Latent World Models

Jianhao Yuan ⋅ Zhang Xiaofeng ⋅ Felix Friedrich ⋅ Nicolas Beltran-Velez ⋅ Melissa Hall ⋅ Reyhane Askari ⋅ Xiaochuang Han ⋅ Nicolas Ballas ⋅ Michal Drozdzal ⋅ Adriana Romero-Soriano

Highlight

Paper PDF

Abstract

State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, on the challenging PhysicsIQ benchmark we achieve 62.00% final score, outperforming previous state of the art by 6.78%. Our work demonstrates the viability of using latent world models to improve physical plausibility of video generation, beyond this specific instantiation or parameterization.