A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
Abstract
Anticipating diverse future states is a central challenge in video world modeling. A key limitation lies in the computational cost of generating multiple plausible futures with existing world models. Recent work demonstrates that predicting the future in the latent space of a vision foundation model (VFM), rather than in raw pixel space, greatly improves efficiency. Despite this progress, efficient VFM-based world models are still predominantly discriminative, producing predictions that implicitly average over many possible futures. To explicitly and efficiently model diverse plausible futures, we introduce DeltaWorld, the first VFM-based world model which shifts from deterministic prediction to the ability to generate multiple plausible futures in a single forward pass. At the core of DeltaWorld is DeltaTok, a tokenizer that encodes feature differences between consecutive frames into a single compact “delta” token, effectively reducing redundancy among temporally adjacent feature maps. By representing futures as delta tokens, DeltaWorld efficiently generates multiple diverse predictions in parallel. Experiments on dense forecasting tasks demonstrate that DeltaWorld is capable of predicting futures that more closely align with real-world outcomes, while being orders of magnitude more efficient than existing generative world models. Code will be made publicly available.