Poster
Customized Condition Controllable Generation for Video Soundtrack
Fan Qi · KunSheng Ma · Changsheng Xu
Recent advancements in latent diffusion models (LDMs) have led to innovative approaches in music generation, allowing for increased flexibility and integration with other modalities. However, existing methods often rely on a two-step process that fails to capture the artistic essence of videos, particularly in the context of complex videos requiring detailed sound effect and diverse instrumentation. In this paper, we propose a novel framework for generating video soundtracks that simultaneously produces music and sound effect tailored to the video content. Our method incorporates a Contrastive Visual-Sound-Music pretraining process that maps these modalities into a unified feature space, enhancing the model's ability to capture intricate audio dynamics. We design Spectrum Divergence Masked Attention for Unet to differentiate between the unique characteristics of sound effect and music. We utilize Score-guided Noise Iterative Optimization to provide musicians with customizable control during the generation process. Extensive evaluations on the FilmScoreDB and SymMV\&HIMV datasets demonstrate that our approach significantly outperforms state-of-the-art baselines in both subjective and objective assessments, highlighting its potential as a robust tool for video soundtrack generation.
Live content is unavailable. Log in and register to view live content