Beyond What's Shared: Recovering Lost Unique Information from Intermediate Layers to Boost Multimodal Geo-Foundation Models
Abstract
Learning general-purpose representations of geographic locations has become essential to geospatial tasks such as population estimation and environmental monitoring. To obtain such representations, multimodal geo-foundation models often use contrastive learning (CL) to align satellite imagery with geo-coordinates, implicitly assuming that cross-modal (shared) information suffices for downstream tasks. However, not all task-relevant information is shared between modalities, and retaining modality-specific (unique) features can improve task performance. Prior methods retain unique information through extra training objectives or databases, increasing training complexity and computation. Motivated by the conventional wisdom that earlier layers capture general input features while later layers become task-specific, we hypothesize that early layers in CL models consist unique information that is lost toward the final layer. Through a comprehensive layerwise analysis of modality gap, representation similarity, and mutual information, we confirm this trend and find that fusing intermediate (more unique) and final (more shared) representations outperforms state-of-the-art models across diverse geospatial benchmarks. Our findings reveal underutilized information diversity in CL models and show that simple layerwise fusion is an efficient path to richer geo-embeddings.