Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting
Da Zhang ⋅ Bingyu Li ⋅ Feiyu Wang ⋅ Zhiyuan Zhao ⋅ Junyu Gao
Abstract
Zero-shot object counting (ZSOC) aims to enumerate objects of arbitrary categories specified by text descriptions without requiring visual exemplars. However, existing methods often treat counting as a coarse retrieval task, suffering from a lack of fine-grained quantity awareness. Furthermore, they frequently exhibit spatial insensitivity and degraded generalization due to feature space distortion during model adaptation. To address these challenges, we present **QICA**, a novel framework that synergizes quantity perception with robust spatial cast aggregation. Specifically, we introduce a Synergistic Prompting Strategy (**SPS**) that adapts vision and language encoders through numerically conditioned prompts, bridging the gap between semantic recognition and quantitative reasoning. To mitigate feature distortion, we propose a Cost Aggregation Decoder (**CAD**) that operates directly on vision-text similarity maps. By refining these maps through spatial aggregation, CAD prevents overfitting while preserving zero-shot transferability. Additionally, a multi-level quantity alignment loss ($\mathcal{L}_{MQA}$) is employed to enforce numerical consistency across the entire pipeline. Extensive experiments on FSC-147 demonstrate competitive performance, while zero-shot evaluation on CARPK and ShanghaiTech-A validates superior generalization to unseen domains. Code is provided in the appendix.
Successful Page Load