Poster Sun, Jun 7, 2026 • 2:30 PM – 4:30 PM PDT ExHall A 596

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Jiahao Wang ⋅ Yufeng Yuan ⋅ Rujie Zheng ⋅ Youtian Lin ⋅ Jian Gao ⋅ Lin-Zhuo Chen ⋅ Yajie Bao ⋅ Chang Zeng ⋅ Yanxi Zhou ⋅ Xiaoxiao Long ⋅ Hao Zhu ⋅ Zhaoxiang Zhang ⋅ Xun Cao ⋅ Yao Yao

Project Page

Abstract

Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion.To this end, we collect SpatialVID, a dataset consisting of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions.Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions.Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.Through extensive validation experiments, we demonstrate SpatialVID’s effectiveness across tasks such as controllable video generation, world simulation and geometric reconstruction, providing a strong foundation for spatial intelligence research.