Time Without Time: Pseudo-Temporal Representation for Space-Time Super-Resolution
Abstract
Space-time video super-resolution (STVSR) is a task aimed at simultaneously upsampling a video in both spatial and temporal dimensions. Previous studies on STVSR have primarily focused on task-specific architectures and modeling paradigms, while effective pretraining strategies remain underexplored. In this paper, we propose a pseudo-temporal space–time reconstruction pretraining framework for STVSR networks that enables effective use of image datasets, which naturally provide strong spatial cues. Each training sample is constructed by duplicating a single image into a pseudo-temporal video and independently zero-filling random pixel regions across its frames. Instead of designing a separate pretraining module, we pretrain the STVSR network on a task aligned with its core objectives of spatial restoration and cross-frame aggregation. The model learns to reconstruct clean, higher-spatio-temporal-resolution outputs from degraded, pseudo-temporal inputs, with a modulation factor encouraging greater focus on difficult regions. Extensive experiments show that our simple pretraining significantly improves STVSR performance and outperforms existing video representation learning approaches. We note our method is effective even when pretraining and finetuning with a limited quantity of data.