V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence
Jiancheng Pan ⋅ Runze Wang ⋅ Tianwen Qian ⋅ Mohammad Mahdi ⋅ Yanwei Fu ⋅ Xiangyang Xue ⋅ Xiaomeng Huang ⋅ Luc Van Gool ⋅ Danda Paudel ⋅ Yuqian Fu
Abstract
Cross-view object correspondence, exemplified by the representative task of ego–exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V$^{2}$-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V$^{2}$-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V$^{2}$-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego–exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V$^{2}$-SAM, achieving new state-of-the-art performance on Ego-Exo4D (Ego–Exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence). Codes will be released upon acceptance.
Successful Page Load