RECS4R: Bridging Semantics and Geometry for Referring Remote Sensing Interpretation
Abstract
Referring expression comprehension and segmentation (RECS) task plays a vital role in remote sensing due to its high efficiency in multi-tasking. However, RECS has reached a performance bottleneck rooted in representational insufficiency, primarily due to cross-task representational fragmentation in multi-task interpretation. In this paper, we propose RECS4R, a unified multi-task framework to upgrade RECS performance. At representation level, we introduce language-guided unified contour decoding paradigm (LCUDP) that takes language-conditioned contour as the intermediate carrier to decode REC and RIS synchronously, structurally preserving geometric and semantic consistency and enabling lightweight, efficient decoding. At refinement level, we introduce residual coarse-to-fine encoding (RCE), shifting fine stage from learning-from-scratch to error correction. At reaggregation level, we design channel isolated multi-scale fusion (CIMF) to achieve lossless feature fusion. At regularization level, we employ gradient consistency loss (GCL) to enhance LCUDP and improve boundary boundary adherence. Moreover, we validate RECS4R on remote-sensing and natural datasets, including RefDIOR, RRSIS-D, OPT-RSVG, RefCOCO, RefCOCO+, and RefCOCOg, and verify the image encoder under CNN, Transformer, and Mamba backbones, achieving advanced performance. The code will be coming soon.