SkySense-VITA: Towards Universal In-context Segmentation of Multi-modal Remote Sensing Imagery
Abstract
While recent foundation models for remote sensing (RS) segmentation have shown notable progress, they still face significant challenges, struggling to process diverse multi-modal inputs, synergize complementary prompt types, and leverage semantic hierarchies. To address these limitations, we introduce SkySense-VITA, a unified in-context segmentation model, which synergistically processes both optical and SAR imagery using visual, textual, or fused prompts. Based on a novel prompt-and-prediction decoupling strategy, we propose the VITA-Former and VITA-Decoder to decouple multi-modal prompt fusion and prediction process, allowing the model to flexibly support visual-only, textual-only, and fused prompt modes. We train SkySense-VITA with a progressive two-stage strategy: a first stage of Image-Level Alignment Pretraining featuring optical-SAR alignment, and a second stage of Pixel-Level In-context Pretraining using Semantic Granularity Annealing (SGA), a coarse-to-fine curriculum that enables robust hierarchical learning. To support this training, we also introduce our new large-scale, multi-modal Sky-VT-300k dataset. Extensive experiments show SkySense-VITA establishes a new state-of-the-art (SOTA) on 18 datasets, with an average performance lead of over 10\% mIoU. Code, models, and data will be released upon acceptance.