GeoCoT: Towards Reliable Remote Sensing Reasoning with Manifold Perspective
Daixun Li ⋅ Zirui Li ⋅ Sibo He ⋅ Jiayun Tian ⋅ Mingxiang Cao ⋅ Weiying Xie ⋅ Yunke Wang ⋅ Xin Zhang ⋅ Yusi Zhang ⋅ Yunsong Li ⋅ Chang Xu ⋅ Leyuan Fang
Abstract
Multimodal Large Language Models (MLLMs) have shown strong potential in remote sensing (RS) through multi-task reasoning and cross-modal generalization.However, existing RS-MLLMs mainly rely on a single shared expert for all tasks, making it hard to produce reliable results. Meanwhile, the intrinsic redundancy and homogeneity of RS images bring substantial difficulties for both training and inference. These challenges directly conflict with the demands of remote sensing, which values task precision and trustworthy reasoning.To address these limitations, we propose GeoCoT, a manifold-driven mixture-of-experts (MoE) system with Chain-of-Thought (CoT) reasoning. GeoCoT introduces Mani-MoE, a sparse expert architecture grounded in local manifold mapping. It projects high-dimensional tokens onto low-rank subspaces adaptively to eliminate redundancy and uncover intrinsic structure, and then routes them through a sparse expert pathway, where gating decisions are guided by the manifold structure of the input.To optimize this architecture, we adopt a CoT-driven multi-stage training strategy. It leverages a cold-start phase for domain adaptation, followed by our RS Vision Group Relative Policy Optimization (RSV-GRPO) to systematically strengthen structured reasoning from global to objectives. Furthermore, we innovatively build *RS-CoT-20k* dataset for task-specific supervision.Extensive experiments on multi-task datasets demonstrate that GeoCoT outperforms prior approaches, achieving $5.27 \\%$ higher average accuracy than the state-of-the-art method. Our code will be available.
Successful Page Load