Semantic-Adaptive Diffusion for Dynamic Spatiotemporal Fusion
Abstract
Frequent and precise land surface monitoring is critical for numerous applications, yet existing satellites struggle to achieve both simultaneously. Spatiotemporal fusion (STF) tackles this challenge by integrating multiple satellite images to generate data with improved temporal and spatial resolution, enabling more frequent and precise land surface observations. However, current methods often fail to recover dynamic landscape changes due to significant scale discrepancies among multi-source images. To address these challenges, we propose a semantic-adaptive diffusion framework for dynamic spatiotemporal fusion (SA-STF), which constrains the solution space using low-resolution and high-frequency features decoupled via a Taylor-inspired decoder. By incorporating temporal feature alignment and semantic-adaptive fusion modules, SA-STF projects multimodal images with temporal dynamics into a unified latent space, and adaptively enhances spatial details while maintaining the spectral consistency of the reconstructed images. Experiments on benchmark datasets demonstrate that SA-STF outperforms existing methods in both quantitative and qualitative evaluations, particularly in complex and dynamic scenes.