Poster
MINIMA: Modality Invariant Image Matching
Jiangwei Ren · Xingyu Jiang · Zizhuo Li · Dingkang Liang · Xin Zhou · Xiang Bai
Image matching for both cross-view and cross-modality plays a critical role in multi-modal perception. Due to the modality gap caused by different imaging systems/styles, the matching task poses great challenges. Existing works try to extract invariant features for specific modality, and train on limited dataset, showing poor generalization. To this end, we present MINIMA, a unified image matching framework for multiple cross-modal cases. Without pursuing fancy modules, our MINIMA aims to enhance the universal performance from the perspective of data scaling-up. For such purpose, we propose a simple yet effective data engine that can freely produce a large dataset containing multiple modalities, rich scenarios, and accurate labeling. Specifically, we scale-up the modalities from cheap but rich RGB-only matching data, by means of generative modules. With this setting, the matching labels and rich diversity of RGB dataset are well inherited by the generated multimodal data. Benefiting from this, we construct MD-syn, a new comprehensive dataset that fills the data gap for general multi-modal image matching. With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modality ability. Extensive experiments on synthetic and real datasets demonstrate that our MINIMA can achieve large enhancement for cross-modal matching, and even outperforms those modality-specifically designed methods. We also test the zero-shot performance on remote sensing and medical tasks, which reveals the best generalization of our MINIMA against the SOTA methods. Our dataset and code will be released.
Live content is unavailable. Log in and register to view live content