DyFCLT: Dynamic Frequency-Decoupled Cross-Modal Learning Transformer for Multimodal Tiny Object Detection
Abstract
Multimodal tiny object detection plays a critical role in real-world applications. However, detecting tiny objects remains challenging due to environmental complexities. While recent methods leverage spatial multi-scale representations or frequency-domain enhancements, most focus solely on visible images and overlook complementary multimodal frequency cues. This paper explores how to effectively harness cross-modal frequency information for infrared–visible tiny object detection. Through frequency characteristic analysis, we observe that tiny objects exhibit rich mid- and high-frequency energy across both modalities, motivating the design of a Dynamic Frequency-decoupled Cross-modal Learning Transformer (DyFCLT). Our approach introduces a Dynamic Frequency-Band Decoupled Cross-Modal Attention (DFCA) mechanism to extract and interact frequency components across modalities. To suppress noise while enhancing foreground signals, a Selective Smoothing Enhancement (SSE) strategy is proposed, which smoothes background interference and guides multi-scale feature fusion. DFCA and SSE collaborate to achieve synergistic enrichment and refinement of cross-modal features. Extensive experiments on two tiny-object benchmarks and one general-scale benchmark demonstrate that DyFCLT sets new state-of-the-art results, outperforming prior leading methods by significant margins and exhibiting strong generalization across scales and scenarios.