CVPR Tutorial From Video Generation to World Model

Tutorial

From Video Generation to World Model

Zhaoxi Chen

[ Abstract ] [ Project Page ]

Wed 11 Jun 6 a.m. PDT — 3 p.m. PDT

Abstract:

In the past few years, the research community has witnessed remarkable advancements in generative models, especially in the realm of video generation. Generating compelling and temporal coherent videos is challenging but demanding. To overcome these challenges, early text-to-video (T2V) methods have explored the potential of text-to-image (T2I) pretraining, such as Make-A-Video, MagicVideo, and Lavie. With the success of Diffusion Transformers (DiT), the first T2V model, which can support generating up to 40 seconds and high-fidelity videos, named SORA, was proposed. The availability of large-scale high-quality video datasets are proved to be indispensable. Later methods, including CogVideoX and MovieGen, have further explored the potential of 3D VAE. However, the current largest T2V model still fails to maintain the physical standard in most of the generative videos. On the other hand, recent work such as Genie, Genie-2, and GameNGen has presented promising results towards action conditioned video generation, showing the great potential of controllable video generation toward world models. Thus, in this tutorial, we first would like to give a comprehensive background on text-to-video generation – by reviewing the previous and most recent advanced T2V methods. Then, we would like to discuss the connection, future directions, and potential solution from the current video generation model to the ultimate world model.

Chat is not available.