FlashDecoder: Real-Time Latent-to-Pixel Streaming Decoder with Transformers
Abstract
Recent progress in video generation has shifted large-scale models from convolutional architectures to Diffusion Transformers (DiT), yet latent-to-pixel video decoders remain predominantly convolutional. These decoders rely on heavy 3D convolutions, which slow down streaming generation and require spatial–temporal tiling to handle high-resolution or long-duration outputs. We introduce FlashDecoder, the first Transformer-based latent-to-pixel video decoder designed for streaming. FlashDecoder processes video latents frame-by-frame during both training and inference, applying bidirectional spatial attention within each frame while maintaining causal temporal dependencies through a rolling KV cache. Crucially, causality is enforced by sequential frame processing rather than explicit attention masks, enabling the use of memory-efficient bidirectional attention kernels throughout. This unified streaming approach ensures constant per-frame computation and bounded memory via a fixed-size KV cache with automatic eviction of older frames, enabling stable training at resolutions up to 720p. Integrated into the Wan2.2 video VAE, FlashDecoder matches the reconstruction quality of the convolutional decoder (PSNR 38.38 vs. 38.29; LPIPS 0.046 vs. 0.039) while decoding up to 4x faster—139 FPS at 480p and 69.6 FPS at 720p—achieving real-time high-resolution video decoding on a single H100 GPU.