Poster Fri, Jun 5, 2026 • 9:45 AM – 11:45 AM PDT ExHall A-F 501

VGA: Empowering Aerial-Ground Localization by Visual Geometry Alignment

Tao Jun Lin ⋅ Yujiao Shi ⋅ Hongdong Li

Abstract

Aerial-ground visual localization is a challenging task due to the significant differences in scene scale and view point captured between two views. In this work, we explore the practical benefit of jointly learning camera calibration and bird’s-eye-view (BEV) projection for estimating full 6 Degrees-of-freedom relative camera pose between uncalibrated aerial and ground views. We present Visual Geometry Alignment (VGA), a unified framework that jointly learns a global gravity-alignment prior inferred from dense monocular perspective fields, and a planar alignment prior complementing the unobserved azimuth angle through Procrustes alignment in a shared BEV plane. At inference, we jointly refine the relative camera pose by integrating the predicted per-camera gravity alignment and relative planar azimuth angle, yielding improved orientation and translation alignment from visual input with extreme wide base-lines and limited overlap. We evaluate our method on challenging MatrixCity, ACC-NVS1 and ULTRRA ground-aerial pairs, demonstrating that optimizing with learned geometric priors can further improve the camera pose estimation across diverse altitudes and environment.