Zoo3D: Zero-Shot 3D Object Detection at Scene Level
Andrey Lemeshko ⋅ Bulat Gabdullin ⋅ Nikita Drozdov ⋅ Anton Konushin ⋅ Danila Rukhovich ⋅ Maksim Kolodiazhnyi
Abstract
3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing $Zoo3D$, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. $Zoo3D$ operates in two modes: the zero-shot $Zoo3D_{0}$, which requires no training at all, and the self-supervised $Zoo3D_{1}$, which refines 3D box prediction by training a class-agnostic detector on $Zoo3D_{0}$-generated pseudo labels. Furthermore, we extend $Zoo3D$ beyond point clouds to work directly with posed and even unposed images. Across ScanNet200 and ARKitScenes benchmarks, both $Zoo3D_{0}$ and $Zoo3D_{1}$ achieve state-of-the-art results in open-vocabulary 3D detection. Remarkably, our zero-shot $Zoo3D_{0}$ outperforms all existing self-supervised methods, hence demonstrating the power and adaptability of training-free, off-the-shelf approaches for real-world 3D understanding.
Successful Page Load