Tea-Adapter: Teacher Adapter for Efficient Conditional Generation
Abstract
We propose Tea-Adapter, a plug-and-play adapter designed to efficiently integrate conditional knowledge from a smaller teacher model into a larger student video diffusion model. Existing controllable video DiT methods face critical challenges: full fine-tuning of billion-parameter models is extremely expensive, while cascaded ControlNets introduce substantial parameter overhead and exhibit limited flexibility for novel multi-condition compositions.To overcome these issues, Tea-Adapter introduces a novel reversed distillation method, enabling large video diffusion models to inherit precise control capabilities from smaller, efficiently-tuned teacher diffusion models, eliminating full fine-tuning. Moreover, recognizing the intrinsic relationships between different conditions, we replace the cascaded ControlNet design with a Mixture of Condition Experts (MCE) layer. This structure dynamically routes diverse conditional inputs within a unified architecture, supporting both single-condition control and multiple condition combinations without additional training cost.To achieve cross-scale knowledge transfer, we further develop a Feature Propagation Module to ensure efficient and temporally consistent feature propagation across video frames.Experiments demonstrate that Tea-Adapter enables high-fidelity multiple condition video synthesis, making advanced controllable video generation feasible on low-resource hardware and establishing a new efficiency standard for the field.