SPE-MVS: Spatial Position Encoding Enhanced Multi-View Stereo with Monocular Depth Priors
Abstract
Learning-based Multi-View Stereo (MVS) methods have become the mainstream in the field, relying on the construction of cost volumes through multi-view feature similarity computation and regularization. However, existing methods depend heavily on photometric consistency across views, leading to poor performance in challenging regions, such as weakly textured or non-Lambertian surfaces. To overcome this limitation, we propose SPE-MVS, a novel MVS framework enhanced with Spatial Position Encoding (SPE). The SPE represents the 3D positional information of pixels in each image within a unified metric space, constructed using monocular depth priors. We integrate the SPE alongside image data as input and introduce a Photometric-Spatial Hybrid Feature Extractor, along with an SPE-enhanced cost volume construction module. These components incorporate spatial position-based similarity computation, substantially improving robustness in challenging areas. Furthermore, we propose a Monocular Depth-guided Enhancement (MDGE) module that enhances depth probability distributions using monocular depth priors, thereby further boosting the depth estimation performance. Extensive experiments demonstrate that our method significantly improves reconstruction quality in difficult regions and achieves state-of-the-art (\textit{SOTA}) performance on multiple benchmarks.