InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy
Yang Tian ⋅ Yuyin Yang ⋅ Yiman Xie ⋅ Zetao Cai ⋅ Xu Shi ⋅ Ning Gao ⋅ Hangxu Liu ⋅ Xuekun Jiang ⋅ Zherui Qiu ⋅ Feng Yuan ⋅ Yaping Li ⋅ Ping Wang ⋅ Junhao Cai ⋅ Jia Zeng ⋅ Hao Dong ⋅ Jiangmiao Pang
Abstract
Recent work explores how real and synthetic data contribute to VLA model generalization. While the $\pi$-series model has shown the strong effectiveness of large-scale real-robot pre-training, synthetic data has not previously demonstrated comparable capability at scale.This paper provides the first evidence that synthetic data alone can match the performance of the strongest $\pi$-dataset in pre-training a VLA model, revealing the substantial value of large-scale simulation.The resulting model also exhibits surprisingly strong zero-shot sim-to-real transfer on several challenging tasks.Our synthetic dataset, InternData-A1, contains over 630k trajectories and 7,433 hours across 4 embodiments, 18 skills, 70 tasks, and 227 scenes, covering rigid, articulated, deformable, and fluid-object manipulation. It is generated through a highly autonomous, fully decoupled, and compositional simulation pipeline that enables flexible task assembly, long-horizon skill composition, and heterogeneous embodiments with minimal manual tuning.Using the same architecture as $\pi_0$, we pre-train a model entirely on InternData-A1 and find that it matches the official $\pi_0$ across 49 simulation tasks, 5 real-world tasks, and 4 long-horizon dexterous tasks.We will open-source both the dataset and the generation pipeline to broaden access to large-scale robotic data and to lower the barrier to scalable data creation for embodied AI research.
Successful Page Load