Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models
Zhirong Shen ⋅ Rui Huang ⋅ Jiacheng Liu ⋅ Chang Zou ⋅ Peiliang Cai ⋅ Shikang Zheng ⋅ zhengyi shi ⋅ Liang Feng ⋅ Linfeng Zhang
Abstract
Diffusion Transformers (DiTs) have achieved state-of-the-art image and video generation performance, but sampling remains expensive due to repeated transformer forward passes over many timesteps. Feature caching offers a training-free way to accelerate inference by reusing or forecasting hidden representations, yet recent forecasting-based methods derive their coefficients from hand-crafted formulas (e.g., Taylor expansion), which ultimately reduce to fixed linear combinations of a few historical features. Such fixed coefficients are suboptimal and fragile under aggressive skipping. In this paper, we first show that existing forecasting-based caching methods can be unified in a common linear form, and then analyze DiT feature trajectories, finding that for most denoising steps the current feature can be reconstructed from past features with projection fidelity above 0.95, indicating that accurate linear prediction is feasible. Motivated by this, we propose $L^2P$ (Learnable Linear Predictor), a simple data-driven caching framework that replaces hand-designed coefficients with learnable per-timestep weights trained on a small set of cached trajectories using a mean-squared error loss, converging in about 20 seconds on a single GPU. Extensive experiments on state-of-the-art DiTs demonstrate that L2P consistently outperforms existing caching baselines: on FLUX.1-dev, L2P achieves a 4.55$\times$ FLOPs reduction and 4.15$\times$ latency speedup with a PSNR of 31.459, and on Qwen-Image and Qwen-Image-Lightning, it maintains high visual fidelity even under up to 7.18$\times$ acceleration, where prior methods suffer from noticeable quality degradation. These results show that learning linear predictors is a practical and effective alternative to designing increasingly complex forecasting formulas for efficient diffusion model inference.
Successful Page Load