ORSATR-X: A Foundation Model based on Differential-and-Excitation Networks for Optical Remote Sensing Object Recognition
Abstract
Recent advances in Remote Sensing Foundation Models (RSFMs) have demonstrated considerable potential for Earth Observation (EO) tasks. While adopting natural image foundation models (e.g., DINO) provides a data-efficient strategy for building RSFMs, their strong generalization capability does not fully transfer to complex remote sensing (RS) scenarios due to severe background interference, notably in perceiving challenging targets like low-contrast objects. To this end, we propose ORSATR-X, a novel RSFM that effectively integrates the generalizable representations of DINOv3 with a dedicated mechanism for exciting local contrast information. ORSATR-X comprises two core components: (1) a DINOv3 encoder, which provides rich feature representation under limited RS pretraining data, and (2) a carefully designed side network incorporating a Weber Local Adapter (WLA) and a Multi-scale Aggregation Module (MSAM). The WLA enhances discriminability of low-contrast boundaries in complex scenes through center-surround contrast and directional gradient information enhancement, while the MSAM handles inherent object scale variations in RS imagery by adaptive aggregation of features across multiple scales. Furthermore, we pretrain the side network using an efficient self-supervised distillation strategy. Extensive experiments on scene classification, object detection, and semantic segmentation demonstrate that ORSATR-X achieves state-of-the-art performance among existing RSFMs, demonstrating the effectiveness of our design.