Scaling4D: Pushing the Frontier of Video Novel View Synthesis through Large-Scale Monocular Videos
Abstract
Video Novel View Synthesis (VNVS) aims to render arbitrary novel viewpoints of dynamic scenes from a single-view video, but its algorithmic training faces a major challenge: the lack of large-scale multi-view video datasets. Prior methods often train on monocular data by framing it as an inpainting task, which typically leads to a train-inference gap and visual artifacts. While synthetic multi-view data can partially alleviate the data scarcity issue, its high acquisition costs and limited diversity restrict scalability. To address these problems, we propose Scaling4D, a novel strategy that theoretically avoids the train-inference gap while leveraging large-scale monocular videos for training. Specifically, we take a higher-level perspective on the problem, reformulating VNVS into a general correspondence-guided generation task. Furthermore, in conjunction with extensive real-world data, we establish a synthetic data pipeline integrated with our training strategy to enhance precision. Qualitative and quantitative results demonstrate a positive correlation between performance and training data volume, confirming our scalability.