PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
Abstract
Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data introduces fundamentally different challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these UAV-specific conditions, we formally define the UAV Reasoning Segmentation task and organize its semantic demands into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, the first large-scale UAV reasoning segmentation benchmark, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision covering all three reasoning types. We further propose PixDLM, a pixel-level multimodal language model equipped with a Dual-Path Vision Encoder that preserves fine-grained high-resolution cues while maintaining strong global semantic alignment. Extensive experiments on DRSeg demonstrate that PixDLM achieves superior semantic consistency and spatial localization accuracy compared with existing multimodal models, offering a unified and efficient baseline for UAV reasoning segmentation. All datasets, models, and code will be released.