Omni2Sound: A Fundamental Study on Dataset, Base Model, and Benchmark for Unified Video-Text-to-Audio Generation
Abstract
Training a unified model for the generation of video-to-audio (V2A), text-to-audio (T2A) and joint video-text-to-audio (VT2A) offers significant flexibility, but is hindered by critical and unexplored challenges. We identify two foundational problems: (1) the scarcity of high-quality audio captions that feature a tight A-V-T alignment, leading to severe semantic conflict in multimodal training data, and (2) cross-task and intra-task competition during joint multi-task training, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, the first large-scale, human-expert-level audio caption dataset, augmenting VGGSound and AudioSet with semantically rich and temporally detailed captions. Powered by a novel, multi-turn agentic annotation pipeline (using advanced foundation models) that operates cost-effectively, SoundAtlas features a tight A-V-T alignment and a much lower hallucination rate than existing datasets. Second, we propose Omni2Sound, a diffusion-based unified VT2A model that supports flexible modality combinations. To address cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct VGGSound-Omni, a comprehensive benchmark for unified evaluation of VT2A, V2A and T2A, including challenging off-screen tracks. As a result, with a vanilla DiT backbone, Omni2Sound achieves unified state-of-the-art performance in all three tasks within a single model. It also demonstrates strong generalization across multiple benchmarks with different caption and video styles. Demonstrations are provided in the Appendix.