Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement
Abstract
Multimodal image registration is a fundamental task for multimodal imagery and a prerequisite for downstream cross-modal analysis. Despite recent progress with shared feature extraction and multi-scale architectures, two key limitations remain. First, some methods use disentanglement to learn shared features but mainly regularize the shared part, so modality-private cues can still leak into the shared space. Second, most multi-scale frameworks support only one transformation type, which limits their applicability in real-world scenarios where global misalignment and local deformation coexist.To address these issues, we view hybrid multimodal registration as jointly constructing a stable shared feature space and a unified hybrid transformation within that space. Building on this perspective, we introduce HRNet, a Hybrid Registration Network that couples representation disentanglement with hybrid parameter prediction. A shared backbone with Modality-Specific Batch Normalization (MSBN) produces multi-scale features, while a Cross-scale Disentanglement and Adaptive Projection (CDAP) module suppresses modality-private cues across scales and projects the shared component into a stable subspace suited for correspondence. On top of this shared space, a Hybrid Parameter Prediction Module (HPPM) performs non-iterative, coarse-to-fine estimation of both global rigid parameters and multi-scale fine-grained deformation fields, which are fused into a single coherent deformation field. Extensive experiments on four multimodal datasets demonstrate state-of-the-art performance on both rigid and non-rigid registration tasks. Code will be made publicly available.