Poster
Exploring Contextual Attribute Density in Referring Expression Counting
Zhicheng Wang · Zhiyu Pan · Zhan Peng · Jian Cheng · Liwen Xiao · Wei Jiang · Zhiguo Cao
Referring expression counting (REC) algorithms are for more flexible and interactive counting ability across varied fine-grained text expressions. However, the requirement for fine-grained attribute understanding poses challenges for prior arts, as they struggle to accurately align attribute information with correct visual patterns. Given the proven importance of ''visual density'', it is presumed that the limitations of current REC approaches stem from an under-exploration of ''contextual attribute density'' (CAD). In the scope of REC, we define the CAD as the measure of the information intensity of one certain fine-grained attribute in visual regions. To model the the CAD, we propose a U-shape CAD estimator in which referring expression and multi-scale visual features from GroundingDINO can interact with each other. With additional density supervisions, we can effectively encode CAD, which is subsequently decoded via a novel attention procedure with CAD-refined queries. Integrating all these contributions, our framework significantly outperforms state-of-the-art REC methods, achieves 30% error reduction in counting metics and a 10% improvement in localization accuracy. The surprising results shed lights on the significance of contextual attribute density for REC. Code will be available.
Live content is unavailable. Log in and register to view live content