Video2Robo: 3DGS-based Synthetic Data from One Video Enables Scalable Robot Learning
Abstract
Scalable robot learning is hindered by the high cost of acquiring diverse, high-quality embodied data. Existing data generation approaches partially mitigate this issue but typically depend on hard-to-access hardware and labor-intensive manual effort, with limited generalization to diverse scene configurations. To overcome these limitations, we propose Video2Robo, a framework that generates high-quality and diverse robot data directly from a single human demonstration video, enabling seamless deployment on physical robots. At its core, Video2Robo leverages 3D Gaussian Splatting (3DGS) as a powerful scene representation, enabling high-fidelity rendering and explicit 3D scene editing. The framework tracks temporally consistent motion trajectories of task-relevant objects from raw video footage and identifies key task skills, guiding robots to execute tasks kinematically plausibly under novel object arrangements. Furthermore, by augmenting backgrounds, textures, lighting, and camera views, Video2Robo further enhances the diversity of generated data. Extensive evaluations in both simulation and real-world environments demonstrate that policies trained on Video2Robo data achieve superior generalization and transfer performance.