Content-Aware Dynamic Patchification for Efficient Video Diffusion
Sheng Li ⋅ Connelly Barnes ⋅ Mamshad Nayeem Rizve ⋅ Hongwu Peng ⋅ Zhengang Li ⋅ Ohi Dibua ⋅ Alireza Ganjdanesh ⋅ Xulong Tang ⋅ Yan Kang ⋅ Yifan Gong
Abstract
Diffusion Transformers (DiTs) achieve strong video generation performance but suffer from prohibitive computation cost due to dense spatiotemporal tokenization. Most existing works rely on uniform patchification, tokenizing non-overlapping spatiotemporal with a fixed patch size regardless of the underlying content. This content-agnostic tokenization results in substantial redundant computation, especially in visually simple or static areas. To address this inefficiency while preserving the video generation quality, we propose DynaPatch, a fine-grained dynamic patchification framework that adaptively selects patch sizes for each spatiotemporal region based on content complexity. A lightweight router predicts patch sizes directly from the latents encoded by 3D Variational Autoencoder (VAE), and is jointly optimized with the diffusion model through diffusion loss, an attention-guided saliency alignment loss, and a token-budget regularizer. Learnable patchify/unpatchify layers integrate seamlessly with standard DiT backbones, allowing flexible tokenization without architectural changes. Experiments demonstrate that DynaPatch can effectively reduce redundant computations while preserving fine details, achieving 1.3–1.8$\times$ acceleration with minimal quality degradation. On VBench, DynaPatch attains a Total Score of 83.42 at 30\% token reduction, significantly outperforming prior patchification and token pruning approaches. These results indicate that content-aware patchification offers an effective direction for efficient and scalable video diffusion.
Successful Page Load