VRCLIP: Multimodal Canonical Correlation Alignment for CLIP-Driven Vision-Radio Person Re-Identification
Abstract
Person re-identification (ReID) is critical for public safety, yet the performance of RGB-based methods is limited under challenging lighting and occlusion conditions. In contrast, low-frequency radio frequency (RF) signals, with their superior penetration capability and illumination invariance, provide ideal complementary information. However, a key challenge in fusing these heterogeneous modalities lies in the conventional approach that relies heavily on cross‑modal distribution matching, which often over‑regularizes and weakens the discriminative capacity within each modality. Rather than enforcing direct distribution alignment, canonical correlation analysis (CCA) constructs a shared subspace that maximizes cross‑modal correlation, inherently balancing modality specificity and shared semantics. Inspired by this, we reformulate cross-modal alignment as a correlation maximization problem, avoiding direct constraints on feature distributions and guiding the model to harmonize intra‑modal discriminative learning with cross‑modal alignment. Specifically, VRCLIP first refines CLIP’s visual encoder with illumination‑disentangling objectives, then aligns RGB and RF embeddings in a canonical correlation subspace, and finally employs an RF‑anchored reliability gate for adaptive fusion. To advance the area, we will release {VRR}, the first large‑scale vision–radio ReID dataset with over 650K paired image–radar samples and position annotations for 31 participants. Extensive experiments show state‑of‑the‑art 93.9\% mAP and robust generalization across diverse lighting and occlusion conditions.