Poster
SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction
Enrico Pallotta · Sina Mokhtarzadeh Azar · Shuai Li · Olga Zatsarynna · Jürgen Gall
Predicting future video frames is essential for decision-making systems, yet RGB frames alone often lack the information needed to fully capture the underlying complexities of the real world.To address this limitation, we propose a multi-modal framework for Synchronous Video Prediction (SyncVP) that incorporates complementary data modalities, enhancing the richness and accuracy of future predictions. SyncVP builds on pre-trained modality-specific diffusion models and introduces an efficient spatio-temporal cross-attention module to enable effective information sharing across modalities. We evaluate SyncVP against other video prediction methods on standard benchmark datasets, such as Cityscapes and BAIR, using depth as an additional modality, and demonstrate modality-agnostic generalization on SYNTHIA with semantic segmentation. Notably, SyncVP achieves state-of-the-art performance, even in scenarios where depth conditioning is absent, demonstrating its robustness and potential for a wide range of applications.
Live content is unavailable. Log in and register to view live content