Dual-Granularity Memory for Efficient Video Generation
Hongjun Wang ⋅ Lin Liu ⋅ Jianguo Li ⋅ Tao Lin
Abstract
Video generation using recurrent architectures offers compelling efficiency advantages over attention-based transformers, particularly for long-sequence generation. However, chunked processing in recurrent models creates temporal discontinuities that harm long-range consistency. We introduce two complementary memory mechanisms to address this challenge at different granularities: \textbf{(1) Context Memory} maintains persistent global context within attention chunks through learnable \textit{sink columns} and \textit{boundary buffers}, adding only 150K parameters (\textless 0.1\% overhead); \textbf{(2) Latent Context-as-Memory (LCaM)} extends memory across video segments by storing and retrieving historical latent embeddings, enabling cross-segment consistency without requiring camera annotations or frame reconstruction. Applied to Generalized Spatial-temporal Propagation Networks (GSTPN), our dual-memory approach achieves \textbf{1.54$\times$ faster} inference than attention-based transformers, while excelling in visual quality metrics. Our approach is particularly effective for knowledge distillation scenarios where only pre-extracted latent embeddings are available. This work demonstrates compelling efficiency-quality trade-offs for practical long video generation.
Successful Page Load