RAPID: Reusing Attention Sparsity with Inter-step Adaptation for Efficient Video Diffusion
Shangran Lin ⋅ Lu Lu ⋅ Jian Chen ⋅ Qiang Liu
Abstract
The prohibitive cost of 3D attention hinders high-quality video generation with diffusion models. Existing sparse attention methods either lack content adaptivity (static) or incur excessive overhead from per-step recalculation (dynamic). Our work challenges the necessity of this trade-off, based on a twofold empirical discovery: (1) attention patterns in video diffusion exhibit strong temporal stability, and (2) the requisite computational density progressively decays. This insight motivates RAPID, a framework that performs a one-shot attention block importance estimation early in the generation process. The resulting scores and high-fidelity sparse mask are then cached for efficient reuse, eliminating recalculation overhead. The cached scores also enable an optional, multi-stage adaptive pruning (Turbo mode) for maximum acceleration. On leading models like Wan2.1-14B and HunyuanVideo, our high-fidelity configuration surpasses all baselines across key quality metrics (PSNR, SSIM, LPIPS) under a controlled compute budget. Concurrently, its Turbo mode achieves speedups of up to $1.79\times$ on Wan2.1-14B and $2.01\times$ on HunyuanVideo while maintaining strong visual quality.
Successful Page Load