VGA: Empowering Aerial-Ground Localization by Visual Geometry Alignment
Abstract
Aerial-ground visual localization is a challenging task due to the significant differences in scene scale and view point captured between two views. In this work, we explore the practical benefit of jointly learning camera calibration and bird’s-eye-view (BEV) projection for estimating full 6 Degrees-of-freedom relative camera pose between uncalibrated aerial and ground views. We present Visual Geometry Alignment (VGA), a unified framework that jointly learns a global gravity-alignment prior inferred from dense monocular perspective fields, and a planar alignment prior complementing the unobserved azimuth angle through Procrustes alignment in a shared BEV plane. At inference, we jointly refine the relative camera pose by integrating the predicted per-camera gravity alignment and relative planar azimuth angle, yielding improved orientation and translation alignment from visual input with extreme wide base-lines and limited overlap. We evaluate our method on challenging MatrixCity, ACC-NVS1 and ULTRRA ground-aerial pairs, demonstrating that optimizing with learned geometric priors can further improve the camera pose estimation across diverse altitudes and environment.