Skip to yearly menu bar Skip to main content


Diff-BGM: A Diffusion Model for Video Background Music Generation

Sizhe Li · Yiming Qin · Minghang Zheng · Xin Jin · Yang Liu

Arch 4A-E Poster #307
[ ]
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT


When editing a video, a piece of attractive background music is indispensable. Furthermore, the video background music generation tasks face several challenges, for example, the lack of suitable training datasets, and the difficulties in flexibly controlling the music generation process and sequentially aligning the video and music. In this work, we first propose a high-quality music-video dataset BGM909 with detailed semantics annotation and shot detection to provide multi-modal information about the video and music. We then present novel evaluation metrics that go beyond assessing music quality; we propose a metric for evaluating diversity and the alignment between music and video by incorporating retrieval precision metrics. Finally, we propose a framework named Diff-BGM to automatically generate the background music for a given video, which uses different signals to control different aspects of the music during the generation process, i.e., uses dynamic video features to control music rhythm and semantic features to control the melody and atmosphere. We propose to align the video and music sequentially by proposing a segment-aware cross-attention layer to enhance the temporal consistency between video and music. Experiments verify the effectiveness of our proposed method.

Live content is unavailable. Log in and register to view live content