LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
yushi Huang ⋅ Xingtong Ge ⋅ RUIHAO GONG ⋅ Chengtao Lv ⋅ Jun Zhang
Abstract
Video diffusion models (DMs) have enabled high-quality video synthesis, but their computation costs scale quadratically with sequence length due to the nature of self-attention. While linear attention offers a more efficient alternative, fully replacing quadratic attention demands costly pretraining. This is largely because linear attention lacks sufficient expressiveness and struggles with the complex spatiotemporal dynamics inherent to video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose a selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and even inefficiency of existing objectives in optimizing this challenge transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is highly efficient and recovers model performance. Extensive experiments show that LinVideo achieves a $\mathbf{1.43\text{-}1.71\times}$ speedup while preserving generation quality, and the 4-step distilled models further reduce latency by $\mathbf{15.9\text{-}20.9\times}$ with only a minor drop in visual quality.
Successful Page Load