VLM4RSDet: Collaborative Optimization with Vision-Language Model for Enhancing Remote Sensing Object Detection
Shuohao Shi ⋅ Qiang Fang ⋅ Xin Xu
Abstract
Closed-set object detection in remote sensing imagery has made significant progress, but achieving high detection accuracy remains challenging. Vision-Language Models (VLMs), which possess rich prior knowledge, offer a promising solution to this challenge. However, most existing VLMs are designed for open-vocabulary tasks and exhibit inherent limitations when directly applied to closed-set scenarios, such as notable accuracy degradation and high deployment costs. To address these issues, we propose VLM4RSDet, a novel collaborative training framework that leverages vision-language model to enhance the performance of conventional closed-set remote sensing object detectors. Notably, during inference, VLM4RSDet only retains the standard object detection architecture, thus avoiding any additional deployment overhead. Furthermore, we introduce a Global–Local Cross-Attention (GLCA) module and a Learnable Hierarchical Prediction Strategy (LHPS) to further improve collaborative training performance. Extensive experiments on five benchmark datasets demonstrate the effectiveness and robustness of our approach. In particular, our method outperforms the state-of-the-art by 7.5\% in mAP$_{0.5:0.95}$ on the VisDrone2019 dataset. Our code will be made publicly available.
Successful Page Load