CVPR Poster Taming Teacher Forcing for Masked Autoregressive Video Generation

Poster

Taming Teacher Forcing for Masked Autoregressive Video Generation

Deyu Zhou · Quan Sun · Yuang Peng · Kun Yan · Runpei Dong · Duomin Wang · Zheng Ge · Nan Duan · Xiangyu Zhang

[ Abstract ] [ Paper PDF ]

[ Poster]

Fri 13 Jun 2 p.m. PDT — 4 p.m. PDT

Abstract:

We introduce Masked Autoregressive Video Generation (MAGI), a hybrid framework that combines the strengths of masked and causal modeling paradigms to achieve efficient video generation. MAGI utilizes masked modeling for intra-frame generation and causal modeling to capture temporal dependencies across frames. A key innovation is our Standard Teacher Forcing paradigm, which conditions masked frames on complete observation frames, enabling a smooth transition from token-level to frame-level autoregressive modeling. To mitigate issues like exposure bias, we incorporate targeted training strategies, setting a new benchmark for autoregressive video generation. Extensive experiments demonstrate that MAGI can generate long, coherent video sequences of over 100 frames, even when trained on as few as 16 frames, establishing its potential for scalable, high-quality video generation.

Chat is not available.