Poster
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
Roberto Henschel · Levon Khachatryan · Hayk Poghosyan · Daniil Hayrapetyan · Vahram Tadevosyan · Zhangyang Wang · Shant Navasardyan · Humphrey Shi
[
Abstract
]
Abstract:
Text-to-video diffusion models enable the generation of high-quality videos that follow text instructions, simplifying the process of producing diverse and individual content. Current methods excel in generating short videos (up to 16s), but produce hard-cuts when naively extended to long video synthesis. To overcome these limitations, we present , an autoregressive method that generates long videos of with seamless transitions. The key components are: (i) a short-term memory block called conditional attention module (CAM), which conditions the current generation on the features extracted from the preceding chunk via an attentional mechanism, leading to consistent chunk transitions, (ii) a long-term memory block called appearance preservation module (APM), which extracts high-level scene and object features from the first video chunk to prevent the model from forgetting the initial scene, and (iii) a randomized blending approach that allows for the autoregressive application of a video enhancer on videos of indefinite length, ensuring consistency across chunks. Experiments show that StreamingT2V produces more motion, while competing methods suffer from video stagnation when applied naively in an autoregressive fashion. Thus, we propose with StreamingT2V a high-quality seamless text-to-long video generator, surpassing competitors in both consistency and motion.
Live content is unavailable. Log in and register to view live content