Boosting Vision-Language Models Towards Cross-Domain Incremental Object Detection
Abstract
Incremental Object Detection (IOD) aims to equip detectors with the ability to handle dynamic environments and emerging object categories, and the rise of vision-language models has substantially advanced this goal. However, existing studies often oversimplify real-world scenarios by assuming the incremental tasks come from a single general domain. To better investigate vision-language models under IOD, it is necessary to explore more generalized scenarios that encompass both novel categories and domains. To this end, we propose Cross-Domain Incremental Object Detection (CDIOD), a new benchmark that assesses the ability to continuously adapt to diverse object detection tasks across domains. CDIOD reveals that existing methods struggle to balance between adaptivity and stability under substantial domain shifts. To tackle this challenge, we propose Dynamic Group Subspace (DGS), a novel framework that dynamically groups tasks by distribution to promote knowledge sharing and prevent task collisions; progressively consolidates adapters to build shared subspaces and control parameter growth; and implements a dynamic training pipeline to maintain a proper stability-adaptivity balance. DGS enables vision-language models to effectively handle task streams of various distribution shifts. Extensive experiments across three benchmarks demonstrate that DGS achieves state-of-the-art performance, highlighting its robustness in diverse incremental learning scenarios.