DeltaQuant: 4-bit Video Diffusion Models with Spatiotemporal Delta Smoothing
Abstract
Video diffusion models have achieved remarkable generative performance, but their substantial computational and memory costs pose significant challenges for deployment, especially on consumer GPUs. As recent advances in attention optimization mitigate previous computational bottlenecks, linear layers now dominate both computational cost and inference memory. In this work, we focus on quantizing both weights and activations to 4 bits to accelerate these layers. Previous methods, such as SVDQuant, overlook the highly dynamic nature of activations across denoising timesteps, where outlier channels and magnitudes vary dramatically. However, video data inherently exhibits strong activation similarity among neighboring tokens in space and time, which we term \textbf{spatiotemporal activation similarity}, analogous to how video codecs exploit intra- and inter-frame redundancy. Leveraging this property, we introduce \textbf{DeltaQuant}, which partitions activations into local 3D spatiotemporal cubes and uses each cube's mean token as a \coretoken, quantizing only the small differences (delta tokens) to 4 bits while keeping core tokens in FP8. This decomposition substantially reduces quantization error with minimal overhead.For weight quantization, DeltaQuant incorporates SVDQuant's low-rank decomposition to further reduce quantization error.We also implement an efficient kernel that translates DeltaQuant's computational benefits into real-world speedups.Extensive experiments on Wan 2.2 I2V, Wan 2.2 T2V, and LTX-Video T2V demonstrate that DeltaQuant maintains high generation fidelity.On Wan 2.2, it compresses model size by 2.9× and reduces memory footprint by 2.3×. DeltaQuant is compatible with efficient attention mechanisms and few-step distillation. When integrated with these techniques, it achieves an additional 3.0× acceleration, for a total 111.8× end-to-end speedup. Code and models will be released upon publication.