Poster Sat, Jun 6, 2026 • 3:45 PM – 5:45 PM PDT ExHall A & F 441

PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

shuyan ke ⋅ Yifan Mei ⋅ Changli Wu ⋅ yonghan zheng ⋅ Jiayi Ji ⋅ Liujuan Cao ⋅ Rongrong Ji

Highlight

Abstract

Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data introduces fundamentally different challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these UAV-specific conditions, we formally define the UAV Reasoning Segmentation task and organize its semantic demands into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, the first large-scale UAV reasoning segmentation benchmark, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision covering all three reasoning types. We further propose PixDLM, a pixel-level multimodal language model equipped with a Dual-Path Vision Encoder that preserves fine-grained high-resolution cues while maintaining strong global semantic alignment. Extensive experiments on DRSeg demonstrate that PixDLM achieves superior semantic consistency and spatial localization accuracy compared with existing multimodal models, offering a unified and efficient baseline for UAV reasoning segmentation. All datasets, models, and code will be released.