PromptDepth: Efficient and Promptable Geometric 3D Vision Model for Embodied Intelligence
Abstract
Vision models for embodied intelligence require efficient 3D comprehension and interaction with objects within the scene. Existing 3D reconstruction models either overlook instance-level perception or rely on time-consuming offline reasoning, showing a less adaptability in real-time embodied scenario. In this paper, we present PromptDepth, the first promptable vision model that features both geometric 3D understanding and instance-level interaction especially designed for embodied intelligence. PromptDepth is a feed-forward network that quickly yields panoptic, instanced, or tracked depth map from two corresponding frames, enabling the real-time infer sequences from embodied agents. Specifically, following the minimal prediction problem, we design a promptable Dense Prediction Transformer, making it flexible to interact with unified dense prediction according to a specific prompt. Considering the substantial discrepancy between panoptic and instanced depth map, we further introduce a novel Instanced Label Distribution Smoothing (ILDS) loss, followed by Gram Anchoring, to mitigate the inherent conflict between dense and discrete representation. Trained on synthetic data only, our model achieves state-of-the-art results in both depth estimation and interactive segmentation on public benchmarks. Extensive experiments demonstrate superior visual efficiency in embodied tasks compared to current fundamental models. We believe that our efficient and flexible geometric 3D model offers a new foundation for vision tasks in embodied intelligence. The dataset and the code will be released.