FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips
Abstract
Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatio-temporal aligned audio remains labor-intensive. We propose \textbf{FoleyDesigner}, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio-temporal controllable Foley generation, and professional mixing capabilities.Technically, FoleyDesigner employs a multi-agent architecture for precise spatio-temporal analysis. It achieves spatio-temporal alignment through latent diffusion models trained on spatiotemporal cues extracted from video frames, combined with large language model (LLM)-driven hybrid mechanisms that emulate film industry-grade post-production practices. To address the lack of high-quality stere Foley datasets in film, we introduce \textbf{FilmStereo}, the first professional stereo Foley dataset containing spatial metadata, precise timestamps, and semantic annotations for eight common Foley categories. For application, the framework supports interactive user control while maintaining seamless integration with professional pipelines, including 5.1-channel Dolby Atmos systems compliant with ITU-R BS.775 standards, thereby offering extensive creative flexibility.Extensive experiments demonstrate that our method achieves superior spatio-temporal alignment compared to existing baselines, with integration validated in film industrial-grade workflows.