Seeing Depth Through Frequency and Motion: A Progressive Training Paradigm for Monocular Depth Estimation
Abstract
Self-supervised monocular depth estimation has achieved remarkable progress in recent years, yet frequency aliasing and the lack of fine-grained cross-frame motion modeling still lead to blurred depth boundaries and suboptimal camera motion estimation.To address these challenges, we propose a progressive self-supervised framework that integrates a Frequency-Guided Depth Network (FGDepth) and a PoseQuery Network (PQNet). FGDepth incorporates a plug-and-play Frequency-Guided Sampling module that explicitly enhances high-frequency details and suppresses aliasing artifacts, producing depth maps with sharper boundaries. PQNet employs channel-aligned attention to model fine-grained cross-frame motion features, enabling more accurate and robust camera motion estimation. Furthermore, we design a progressive three-stage decoupled training strategy that effectively leverages the complementarity between depth and pose estimation, further improving overall performance.Extensive experiments on the KITTI benchmark demonstrate state-of-the-art performance, achieving a 4.1% reduction in Sq Rel over strong baselines, and our method also exhibits excellent cross-dataset generalization on Make3D. Ablation studies further validate the effectiveness of each proposed component.