PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models
Abstract
Propagation-based video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context such as motion and structures.However, training such models requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire.Hence, we propose the PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets.Specifically, our PropFly leverages one-step clean latent estimations from intermediate noised latents with varying Classifier-Free Guidance (CFG) scales to synthesize diverse pairs of 'source' (low-CFG) and 'edited' (high-CFG) latents on-the-fly. The source latent serves as structural information of the video, while the edited latent provides the target transformation for learning propagation.Our pipeline enables an additional adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss, which guides the model to replicate the target transformation.Our on-the-fly supervision ensures the model to learn temporally consistent and dynamic transformations.Extensive experiments demonstrate that our PropFly significantly outperforms the state-of-the-art methods on various video editing tasks, producing high-quality editing results.