Poster
From Slow Bidirectional to Fast Causal Video Generator
Tianwei Yin · Qiang Zhang · Richard Zhang · William Freeman · Fredo Durand · Eli Shechtman · Xun Huang
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. We address these limitations by introducing an autoregressive diffusion transformer that is adapted from a pretrained bidirectional video diffusion model. Our key innovations are twofold: First, we extend distribution matching distillation (DMD) to videos, compressing a 50-step denoising process into just 4 steps. Second, we develop an asymmetric distillation approach where a causal student model learns from a bidirectional teacher with privileged future information. This strategy effectively mitigates error accumulation in autoregressive generation, enabling high-quality long-form video synthesis despite training on short clips. Our model achieves a total score of 82.85 on VBench-Long, outperforming all published approaches and, mostly importantly, uniquely enabling fast streaming inference on single GPU at 9.4 FPS. Our method also supports streaming video editing, image-to-video, and dynamic prompting in a zero-shot manner. We will release the code based on an open-source model in the future.
Live content is unavailable. Log in and register to view live content