CDICS: Delving Into Fine-Grained Attribute for In-Context Segmentation via Compositional Prompts and Phased Decoupling
Abstract
In-Context Learning (ICL) has shown great effectiveness in developing generalist image segmentation models. Its significant advantage over text-based descriptions is the ability to convey intricate visual appearance details through simple reference images. However, finding a perfectly matching single example for real-world rare and complex concepts is difficult;and existing methods are largely confined to semantic or instance-level understanding of the reference image, struggle to express more precise segmentation needs through the input. To address this, we propose \textbf{CDICS}, a novel framework that leverages \textbf{C}ompositional prompts and phased task \textbf{D}ecoupling to achieve compositional prompt-controlled \textbf{I}n-\textbf{C}ontext \textbf{S}egmentation. Our method introduces compositional prompts derived from reference prompts, combining semantic, part and color images to dynamically define segmentation targets. To effectively fuse this control information, ensure synergy while suppressing interference, and mitigate feature coupling risks, our decoupled two-stage architecture, which firstly performs coarse-grained semantic localization, then refines the result using compositional appearance prompts to precisely match the specified attributes. This design extends traditional in-context segmentation, enabling it to support compositional prompts. Additionally, we reconstructed two datasets and their benchmarks to acquire data with part-color-specific attributes. Our method demonstrates superior performance on the compositional prompt-controlled in-context segmentation task. It also extends the capabilities of existing in-context segmentation, and makes an attempt toward real-world fine-grained segmentation.