VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
Abstract
We introduce VoxTell, a vision–language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning 1K+ anatomical and pathological classes, VoxTell uses multi-stage vision–language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code and model will be published at: www.github.com/anonymous