SRA-Det: Learning Omni-Grained Open-Vocabulary Detection Beyond Category Names
Abstract
Open-vocabulary object detection (OVD) aims to detect objects described by arbitrary text, but most existing methods operate at a coarse category level and struggle with fine-grained, attribute-sensitive queries. We address this from both model and data perspectives. We propose a Semantic-Retrieval-Augmented Detector (SRA-Det) that uses an attention-based module to retrieve multiple semantic facets from token-level text features, and a soft-min matching rule that behaves like a differentiable logical AND over these facets, ensuring that all key attributes are satisfied. In parallel, we introduce an automatic attribute-augmented data pipeline that uses an LLM to generate category-specific visual attributes and a dual CLIP-based similarity check to verify them at the instance level. With a Swin-T backbone, our approach achieves 54.9 mAP in the zero-shot setting on FG-OVD and 40.4 AP on LVIS, establishing strong fine-grained and general OVD performance.