Skip to yearly menu bar Skip to main content


Poster

Minimizing Labeled, Maximizing Unlabeled: An Image-Driven Approach for Video Instance Segmentation

Fangyun Wei · Jinjing Zhao · Kun Yan · Chang Xu


Abstract:

Traditional video instance segmentation (VIS) models rely on extensive per-frame video annotations, which are both time-consuming and costly. In this paper, we present MinMaxVIS, a novel VIS framework that reduces the dependency on fully labeled video datasets by utilizing a small set of labeled images from the target domain along with a large volume of general-domain, unlabeled images. MinMaxVIS operates in three stages: first, a preliminary segmentation model is trained on the small labeled set from the target domain; this model then retrieves relevant instances from the unlabeled dataset to build a high-quality pseudo-labeled set, ensuring a rich content alignment with the target domain while avoiding the inefficiencies of large-scale semi-supervised learning across the entire unlabeled set. Finally, we train MinMaxVIS on a combination of labeled and pseudo-labeled data, addressing challenges such as noise in pseudo-labels and instance association across frames. To simulate object continuity, we augment static images to create paired frames, allowing MinMaxVIS to capture instance associations effectively. MinMaxVIS outperforms the prior image-driven approach, MinVIS, achieving superior mAP scores with significantly reduced labeled data. For instance, MinMaxVIS with a Swin-L backbone attains 62.2 mAP on YouTube-VIS 2019 using only 2% labeled data and additional unlabeled images from SA-1B. This surpasses MinVIS, which uses the same backbone trained on fully labeled YouTube-VIS 2019, by 0.6 mAP.

Live content is unavailable. Log in and register to view live content