Breaking the 3D Dataset Bottleneck: Fast Scalable Generation of Aligned 3D Assets from Scratch for Category 6D Pose Estimation and Robotic Grasping
Duret Guillaume ⋅ Danylo Mazurak ⋅ Florence Zara ⋅ Jan Peters ⋅ Liming Chen
Abstract
While 2D vision has been revolutionized by large-scale datasets like ImageNet, 3D vision remains constrained by the scarcity of high-quality, canonically aligned data. We introduce the first scalable, automated framework that generates complete category-level 6D pose datasets directly from text prompts, bypassing the need for existing 3D assets. Our method overcomes key challenges by: (1) ensuring reliable, scalable asset generation via a controlled text-to-image-to-3D pipeline; (2) enforcing built-in canonical alignment through depth-conditioned generation, achieving a 96\% pose consistency rate; and (3) enabling large-scale 6D annotation via mixed reality rendering. The pipeline produces high-quality, aligned 3D meshes in under 3 minutes per object—a 5–20$\times$ speedup over traditional scanning. We generate over 1,000 instances for each of the 153 categories in the Omni6Dpose benchmark, culminating in 153,000 aligned meshes—a >40$\times$ increase in instances per category over previous aligned real-world datasets. Extensive evaluation demonstrates competitive zero-shot sim2real transfer on the NOCS 6D pose benchmark and superior robotic grasping performance in both simulation and real-world zero-shot transfer, where aligned meshes prove essential for success. We release the largest publicly available aligned 3D mesh dataset, largest category-level 6D pose dataset, grasping simulation environments, and open-source pipeline, providing a critical step toward foundation models for 3D understanding and enabling efficient, unlimited generation of task-specific 3D data from scratch.
Successful Page Load