ResCa: Residual Caching for Diffusion Transformers Acceleration
Abstract
Diffusion transformers have achieved remarkable progress in high-quality image and video generation, but their computational overhead remains a significant challenge. Existing token reduction-based acceleration techniques, such as caching and merging, attempt to reduce this cost from both temporal and spatial perspectives, but often compromise generation quality by introducing non-updated or non-self denosing directions. In this paper, we propose Residual Caching (ResCa), a novel, training-free framework that introduces a proxy denoising perspective to overcome these limitations. ResCa achieves acceleration while maintaining a denoising trajectory that is both self and updated. The core idea is to perform true denoising on only one proxy token within each trajectory-based cluster, and use its computed multi-order residuals to guide the simulated denoising of all other tokens. ResCa can be seamlessly integrated into various diffusion models, including DiT, FLUX, and HunyuanVideo. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method, achieving up to a 5.5 times acceleration in GFLOPs while maintaining near-lossless generation quality on FLUX.