Sky2Ground: A Benchmark for Site Modeling under Varying Altitude
Zengyan Wang ⋅ Sirshapan Mitra ⋅ rajat modi ⋅ Hui Xian Grace Lim ⋅ Yogesh Rawat
Abstract
In this work, we propose the problem of localizing cameras and producing renders of a scene, given multiple images captured from ground/aerial/satellite viewpoints. We introduce a dataset called Sky2Ground, which contains synthetic/real images across all 3 viewpoints, along with camera parameters, and dense depth-maps/surface-normals. Recent works have shown that transformer-based nets like VGGT are capable of inferring scene-parameters in a single-forward pass. However, we formally reveal that simply fine-tuning such models reduces performance, and can't be solved simply by bruteforce-scaling. We find the culprit to be satellite images, which inject too much noise during the learning process. Therefore, we propose SkyNet to enable learning using satellite-images. SkyNet is a two-stream neural-net, with one stream explicitly processing satellite, and another processing all modalities together.We propose a restricted-attention mechanism, termed as `Masked-Satellite-Attention' which prevents ground/aerial images from interacting with satellite images. Further, our SkyNet is optimized with strategies inspired from curriculum-learning: sampling cameras which are far-away from each other during training. Extensive experiments on our Sky2Earth dataset reveal that SkyNet outperforms existing methods by $23\%$ in terms of absolute performance. Our dataset, and code shall be made publicly available on huggingface.
Successful Page Load