AIMDepth: Asymmetric Image-Event Mamba for Monocular Depth Estimation
Abstract
Monocular depth estimation is critical for applications like autonomous driving and robotics. The complementary properties of event and image modality motivate the fusion-based methods for robust depth estimation. However, existing fusion methods rely on convolutional or attention-based architectures, which either struggle with global dependencies or incur high computational cost, limiting their suitability for long-sequence modeling in depth tasks. Besides, effective image-event fusion remains a key challenge, as most existing methods directly fuse features without addressing the domain gap and differences in representational characteristics between raw events and images, leading to semantic bias and degraded performance. In this work, we propose AIMDepth, an Asymmetric Image-Event Mamba framework for monocular depth estimation, built entirely on state space models to ensure linear computational complexity and accurate prediction. To address input-domain misalignment, we introduce a Spectral Cross-modal Prior Guidance (SCPG) module that performs bidirectional prior injection at the input level. To mitigate representational imbalance between sparse events and dense images, we design an Asymmetric Modal-aware Encoder (AME) that allocates separate encoding paths for each modality and facilitates feature-level alignment tailored to their distinct information densities. To further enhance fusion, we develop a Modality-interactive Local Refinement (ModiLocal) module that enables hierarchical interaction and fine-grained alignment through SSM-based modeling. Extensive experiments on public datasets demonstrate that AIMDepth achieves state-of-the-art performance and strong robustness in complex environments.