Scene-Centric Unsupervised Video Panoptic Segmentation
Abstract
Video panoptic segmentation (VPS) aims to jointly detect, segment, and track all objects while partitioning the video into semantically consistent regions. We introduce the task setting of unsupervised VPS, omitting any human supervision. Existing unsupervised scene understanding works mainly focused on image segmentation tasks; the video domain remains underexplored. We propose CUViPS, the first unsupervised VPS approach. CUViPS generates temporally consistent panoptic video pseudo-labels from monocular scene-centric videos by exploiting unsupervised depth, motion, and visual cues. Training on these pseudo-labels using a novel Video DropLoss yields an accurate and unsupervised VPS model. To benchmark progress, we introduce a comprehensive evaluation protocol and four competitive baselines, extending state-of-the-art unsupervised panoptic image and instance video segmentation models to VPS. CUViPS consistently outperforms all baselines and demonstrates strong label-efficient learning. With CUViPS, our evaluation protocol, and baselines, we provide a strong foundation for future research on unsupervised VPS.