Beyond Weak Supervision: MLLMs-Guided Graded Knowledge Distillation for Unsupervised Camouflaged Object Detection
Abstract
Most Camouflaged Object Detection (COD) methods rely on costly pixel-level annotations. Recent studies have adopted unsupervised COD (UCOD) to eliminate labeling costs, but still suffer from two issues:1) insufficient supervision, leading to reliance on self-supervised backbone DINO and reduced model flexibility; and 2) ineffective use of pseudo-labels, which widens the performance gap with supervised methods and limits real-world applicability. In this paper, we propose a novel teacher-student framework for UCOD to address these two issues. To tackle the lack of supervision, we build a powerful teacher model by integrating Multimodal Large Language Models (MLLMs) and the Segment Anything Model (SAM) to generate high-quality pseudo-labels. However, the teacher model faces two challenges: 1) suboptimal performance of MLLMs in COD, and 2) cascading errors.To address these challenges, we first propose a Camouflaged-Aware Chain-of-Thought (CA-CoT) for MLLMs. CA-CoT guides MLLMs through step-by-step reasoning to simulate human perceptual processes, thereby enhancing their performance in COD.Subsequently, we design a Graded Mask Evaluator (GME) to mitigate cascading errors, which evaluates and grades the quality of masks generated by SAM, and then filters out the low-quality masks to provide more reliable supervision.To better leverage pseudo-labels, we propose Graded Knowledge Distillation (GKD), which adaptively enhances distillation at both image and pixel levels based on pseudo-label quality.Extensive experiments show that our method outperforms existing UCOD approaches by a large margin and achieves performance comparable to weakly supervised methods. Notably, our method also achieves good performance under zero-shot settings.