GeoFree-CoSeg: Unsupervised Point Cloud-Image Cross-Modal Co-Segmentation Without Geometric Alignment
Abstract
Co-segmentation aims to identify and segment common objects across a set of point clouds or images. Existing methods focus on single-modal co-segmentation. However, the limited semantics of a single modality restrict the discovery of common objects, leading to costly and labor-intensive segmentation masks. In contrast, cross-modal co-segmentation leverages both modalities, offering two key advantages: (i) additional semantic cues compensate for the absence of segmentation masks; and (ii) complementary modalities provide richer common semantics beyond the limitations of single-modality approaches. Motivated by these challenges, we introduce a novel task: unsupervised point cloud-image cross-modal co-segmentation. We tackle this problem using a coarse-to-fine approach. First, the 3D and 2D branches extract coarse common semantics from each modality, respectively. Then, a cross-modal common semantic graph purifies these features into fine-grained common semantics. Finally, 3D and 2D common semantic features are fused and mutually enhanced, without requiring geometric alignment. Experiments on two standard point cloud benchmarks and two corresponding image co-segmentation datasets demonstrate our superior performance compared to existing unsupervised state-of-the-art methods.