Generative Neural Video Compression via Video Diffusion Prior
Abstract
We present \textbf{GNVC-VD}, the first DiT-based generative neural video compression framework built upon an advancedvideo generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained \textbf{image} generative priors to restorehigh-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to \textbf{perceptual flickering}. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a \textbf{video diffusion transformer} to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherenceunder extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01~bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.