Poster
Taming Teacher Forcing for Masked Autoregressive Video Generation
Deyu Zhou · Quan Sun · Yuang Peng · Kun Yan · Runpei Dong · Duomin Wang · Zheng Ge · Nan Duan · Xiangyu Zhang
We introduce Masked Autoregressive Video Generation (MAGI), a hybrid framework that combines the strengths of masked and causal modeling paradigms to achieve efficient video generation. MAGI utilizes masked modeling for intra-frame generation and causal modeling to capture temporal dependencies across frames. A key innovation is our Standard Teacher Forcing paradigm, which conditions masked frames on complete observation frames, enabling a smooth transition from token-level to frame-level autoregressive modeling. To mitigate issues like exposure bias, we incorporate targeted training strategies, setting a new benchmark for autoregressive video generation. Extensive experiments demonstrate that MAGI can generate long, coherent video sequences of over 100 frames, even when trained on as few as 16 frames, establishing its potential for scalable, high-quality video generation.
Live content is unavailable. Log in and register to view live content