Skip to yearly menu bar Skip to main content


SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model

Zhengang Li · Yan Kang · Yuchen Liu · Difan Liu · Tobias Hinz · Feng Liu · Yanzhi Wang

Arch 4A-E Poster #374
[ ]
Wed 19 Jun 5 p.m. PDT — 6:30 p.m. PDT

Abstract: While AI-generated content has garnered significant attention, achieving photo-realistic video synthesis remains a formidable challenge. Despite the promising advances in diffusion models for video generation quality, the complex model architecture and substantial computational demands for both training and inference create a significant gap between these models and real-world applications.In this paper, we introduce SNED, the superposition network architecture search for efficient video diffusion model. Our framework employs a supernet training paradigm that targets various model cost and resolution options using a weight-sharing method. Additionally, we introduce a systematic fast training optimization strategy, including methods, such as supernet training sampling warm-up and image-based diffusion model transfer learning. To showcase the flexibility of our method, we conduct experiments involving both pixel-space and latent-space video diffusion models. The results demonstrate that our framework consistently produces comparable results across different model options with high efficiency. According to the experiment for the pixel-space video diffusion model, we can achieve consistent video generation results simultaneously across 64$\times$64 to 256$\times$256 resolutions with a large model size range from 640M to 1.6B number of parameters for pixel-space video diffusion model.

Live content is unavailable. Log in and register to view live content