Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera
Abstract
Modern perception increasingly relies on fisheye, panoramic, and other wide-FoV cameras, yet most pipelines still apply planar CNNs designed for pinhole imagery on 2D grids, where image-space neighborhoods misrepresent physical adjacency and models are sensitive to global rotations. Frequency-domain spherical CNNs partially address this mismatch but require costly spherical harmonic transforms that constrain resolution and efficiency. We introduce the Unified Spherical Frontend (USF), a lens-agnostic framework that lifts images from any calibrated camera to a unit-sphere representation via ray-direction correspondences and performs spherical resampling, convolution, and pooling directly in the spatial domain. USF is modular: projection, location sampling, value interpolation, and output resolution controls are decoupled. Its distance-only spherical kernels provide configurable rotation-equivariance by design (mirroring the translation-equivariance of planar CNNs), while avoiding harmonic transforms. We compare standard planar backbones with their spherical counterparts across classification, detection, and segmentation on synthetic (Spherical MNIST) and real-world datasets (PANDORA, Stanford 2D3DS), and stress-test robustness to extreme lens distortions, varying FoV, and arbitrary rotations. USF processes high-resolution spherical imagery efficiently and maintains \emph{less than 1\%} performance drop under random test-time rotations, even without rotational augmentation, while zero‑shot generalizing from one lens type to previously unseen wide-FoV lenses with minimal degradation.