Skip to yearly menu bar Skip to main content


Scene-adaptive and Region-aware Multi-modal Prompt for Open Vocabulary Object Detection

Xiaowei Zhao · Xianglong Liu · Duorui Wang · Yajun Gao · Zhide Liu

Arch 4A-E Poster #208
[ ]
Thu 20 Jun 5 p.m. PDT — 6:30 p.m. PDT

Abstract: Open Vocabulary Object Detection (OVD) aims to detect objects from novel classes described by text inputs based on the generalization ability of trained classes. Existing methods mainly focus on transferring knowledge from large Vision and Language models (VLM) to detectors based on knowledge distillation. However, these approaches show weak ability in adaptation to diverse classes and alignment between the image-level pre-training and region-level detection, hindering successful knowledge transfer. Motivated by the prompt tuning, we propose scene-adaptive and region-aware multi-modal prompts to address these issues by effectively adapting class-aware knowledge from VLM to the detector at the region level. Specifically, to enhance the adaptability to diverse classes, we design a scene-adaptive prompt generator from a scene perspective to consider both the commonality and diversity of the class distributions, and formulate a novel selection mechanism to facilitate the acquisition of common knowledge across all classes and specific insights relevant to each scene. Meanwhile, to bridge the gap between the pre-trained model and the detector, we present a region-aware multi-modal alignment module, which employs the region prompt to incorporate the positional information for feature distillation and integrates textual prompts to align visual and linguistic representations. Extensive experimental results demonstrate that the proposed method significantly outperforms the state-of-the-art models on the OV-COCO and OV-LVIS datasets, surpassing the current method by 3.0% mAP and 4.6% $\text{AP}_r$.

Live content is unavailable. Log in and register to view live content