Adapting Lightweight Image-based Counting Models for Video Crowd Counting
Abstract
Video crowd counting aims to predict the people count in each frame of a video. It requires effectively leveraging spatio-temporal (ST) information in videos while satisfying real-time constraints. However, most existing methods use ST information from neighboring frames through auxiliary extraction and fusion modules---resulting in large computational cost and the need to buffer multiple frames during inference. Such designs limit their practicality in real-world applications with limited computational resources or stringent real-time requirements. To address these issues, we revisit video crowd counting from the perspective of lightweight image-based counting models that enable real-time deployment under limited resources. We analytically define ST information in a model-independent and statistically interpretable manner, and incorporate it into training via a statistical regularizer that effectively enhances model performance without adding modules or inference overhead. Most framework hyperparameters are further formulated as statistical inference problems, allowing automatic estimation from data and consequently efficient adaptation to new scenarios.Our framework unifies video crowd counting and image-based counting models under a compact, principled formulation that is lightweight, portable, and efficient. We also establish theoretical foundations for adapting image-based counting models to video crowd counting and achieve state-of-the-art accuracy and efficiency across six benchmarks, including challenging DRONECROWD and VSCROWD.