NS-Diff: Fluid Navier–Stokes Guided Video Diffusion via Reinforcement Learning
Abstract
While recent video generation models achieve impressive visual quality, generating physically plausible videos remains challenging, especially for fluid dynamics and rigid-body motions. To address this, we present NS-Diff, a physics-guided reinforcement learning framework for video diffusion. First, we design a noise-robust physical dynamics detector that distinguishes rigid and fluid regions by analyzing motion in noisy latent frames. Second, we introduce a Physics-Conditioned Latent Injection module, which encodes velocity fields, deformation gradients, and material masks, and injects them into the DiT denoiser via cross-attention. Third, we introduce a reinforcement learning optimization module that enforces simplified Navier-Stokes constraints on fluid dynamics and minimum-jerk principles on rigid bodies through policy gradients. Experiments on PhysVideoBench, UCF, and MSR-VTT show that our approach reduces jerk errors by 43\%, decreases fluid divergence by 33\%, and improves FVD by 22.7\%, achieving higher physical plausibility and visual quality.