CG-Reasoner: Centroid-Guided Positional Reasoning Segmentation for Medical Imaging with a Robust Visual-Text Consistency Metric
Abstract
Accurate and interpretable medical image segmentation remains a major challenge, as existing deep learning models primarily optimize pixel-level accuracy while overlooking positional reasoning—an essential component for automated report generation and clinical interpretability. We introduce CG-Reasoner, a novel centroid-guided cross-modal framework that jointly performs medical image segmentation and positional reasoning. CG-Reasoner integrates a multimodal large language model (LLM), a newly designed light-weight encoder–decoder architecture, and a Text2Centroid module that predicts lesion centroids from reasoning embeddings—enabling the model to produce both accurate segmentation masks and spatially coherent, clinically meaningful reasoning explanations. Furthermore, we propose PRScore (Positional-Reasoning Score), a robust evaluation metric that jointly measures the spatial and semantic alignment between generated reasoning text and segmentation masks. Experiments on six medical datasets across different imaging modalities demonstrate that CG-Reasoner achieves state-of-the-art performance, offering precise segmentation, spatially coherent reasoning, and clinically interpretable visual-textual explanations within a unified framework. The source code is available at https://github.com/lpmm2025/CG-Reasoner.