3D-Object Perception Transformer (3PT)
Abstract
Current approaches to zero-shot 3D-object perception typically rely on ensembles of frozen foundation models.This limits deep object understanding and cross-domain generalization, making performance inadequate for real-world deployment. The 3D-Object Perception Transformer (3PT) addresses this limitation by unifying detection, segmentation, and 6DoF pose estimation in a single framework, directly trained for 3D-object perception. Based on two large-scale trained Transformers that specialize in 2D and 3D object-centric scene understanding respectively, 3PT continuously refines its object representations without depth input, enhancing 3D understanding by incorporating multi-view information. 3PT surpasses task-specialized models for detection and pose estimation, often achieving double-digit percentage improvements on the diverse BOP-benchmarks. Achieving high accuracy and robustness, \algshort{} is well-suited for practical industrial robotics applications such as bin picking and precise insertion.