SO(3)-Equivariant ViT-Adapter for Data-Efficient Zero-Shot Sim-to-Real Indoor Panoramic Depth Estimation
Ziyan He ⋅ Qiudan Zhang ⋅ Lin Ma ⋅ Xu Wang
Abstract
Panoramic depth estimation enables a complete $360^\circ$ understanding of 3D environments but faces significant challenges in generalizing to real-world scenes. While recent zero-shot depth models like Depth Anything achieve remarkable generalization on perspective images, their performance sharply degrades on panoramas due to projection distortions and the lack of spherical geometric awareness. Moreover, collecting large-scale panoramic RGB-D data is costly, hindering the large-scale training of panoramic foundation models. To address these issues, we propose an SO(3)-Equivariant ViT-Adapter, which transfers the powerful zero-shot capability of the perspective pre-trained ViT to panoramic depth estimation by explicitly incorporating a rotation-equivariant inductive bias. Our adapter introduces an SO(3) deformable cross-attention mechanism to effectively align SO(3)-equivariant features with perspective features, enhancing rotational consistency without modifying the ViT backbone. Trained solely on synthetic panoramas, our framework achieves robust zero-shot sim-to-real performance on real indoor benchmarks, including Matterport3D and Stanford2D3D, demonstrating both data efficiency and strong generalization for panoramic depth estimation.
Successful Page Load