Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
Abstract
Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (\eg, “left-most apple”) and overlooks functional and physical reasoning (\eg, “where can I safely store the knife?”). We address this gap and introduce Conversational Image Segmentation (CIS) and ConvSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConvSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt–mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConvSeg-Net trained on our data engine achieves significant gains on ConvSeg and maintains strong performance on existing language-guided segmentation benchmarks.