Paper
in
Workshop: Pixel-level Video Understanding in the Wild Challenge

MTA-VPS: A Large-scale Benchmark for Video-Based Person Search

Ding Qi ⋅ Shuguang Dou ⋅ Jian Liu ⋅ Huaixuan Cao ⋅ Hao Zhang ⋅ Dongsheng Jiang ⋅ Cai Rong Zhao

Abstract

Existing person search methods focus on identifying and matching individuals in single-frame images, but this approach struggles with limited information, poor image quality, and lack of dynamic context, reducing accuracy in real-world scenarios. To overcome these challenges, we introduce Video-based Person Search (VPS), a task that tracks individuals and their trajectories in raw video scenes, enhancing performance in practical applications. Due to the lack of suitable datasets and privacy concerns, we created the first VPS dataset using virtual environments, featuring diverse continuous video scenes. Paired with an evaluation framework, this dataset assesses both retrieval performance and trajectory quality. We also explored various models, establishing a comprehensive benchmark for VPS. Video-level research introduces redundancy, leading to missed or false detections and complicating matching tasks. To address this, we propose the Coarse-to-Fine Search-by-Track (CFST) framework, which simplifies the process by conducting an initial search on low-resolution videos, selecting key anchors, and refining tracking to build complete trajectories. Our experiments show that CFST significantly improves retrieval accuracy and trajectory quality, outperforming existing methods on MTA-VPS , demonstrating its adaptability across complex video-level scenarios. The dataset will be available.

Chat is not available.