Skip to yearly menu bar Skip to main content


Poster

Taming Teacher Forcing for Masked Autoregressive Video Generation

Deyu Zhou · Quan Sun · Yuang Peng · Kun Yan · Runpei Dong · Duomin Wang · Zheng Ge · Nan Duan · Xiangyu Zhang


Abstract:

We introduce Masked Autoregressive Video Generation (MAGI), a hybrid framework that combines the strengths of masked and causal modeling paradigms to achieve efficient video generation. MAGI utilizes masked modeling for intra-frame generation and causal modeling to capture temporal dependencies across frames. A key innovation is our Standard Teacher Forcing paradigm, which conditions masked frames on complete observation frames, enabling a smooth transition from token-level to frame-level autoregressive modeling. To mitigate issues like exposure bias, we incorporate targeted training strategies, setting a new benchmark for autoregressive video generation. Extensive experiments demonstrate that MAGI can generate long, coherent video sequences of over 100 frames, even when trained on as few as 16 frames, establishing its potential for scalable, high-quality video generation.

Live content is unavailable. Log in and register to view live content