Depth Any Endoscopy: Towards Self-Supervised Generalizable Depth Estimation in Monocular Endoscopy
Abstract
Monocular depth estimation serves as a core technique in endoscopic applications such as 3D reconstruction and localization. However, most existing methods focus primarily on in-domain depth estimation, which limits their robustness and prevents them from delivering impressive cross-domain performance, due to variations in depth distributions, illumination conditions, and texture patterns. In this work, we propose Depth Any Endoscopy (DAE), a novel self-supervised framework for generalizable depth estimation in monocular endoscopy. To specify, we develop a dual-level Mixture-of-Experts (MoE) adaptation paradigm that effectively tailors Vision Foundation Models to diverse endoscopic procedures, such as laparoscopy and colonoscopy, accounting for the challenges posed by varying environments. Internally, we integrate LoRA and Adapter modules within the MoE architecture, allowing the model to flexibly adapt to the characteristics of input data. Externally, a mixture of domain-specific experts provides customized guidance to enhance the training stability. In addition, we introduce a learnable gradient harmonization mechanism to dynamically balance the optimization between the depth and pose networks, along with a semantic distribution calibration module that strengthens the semantic consistency of depth predictions. Extensive experiments demonstrate that the proposed DAE achieves state-of-the-art performance in both zero-shot and in-domain depth estimation scenarios.