R 2TUA: Reconstruction-residual Based Targeted and Untargeted Attack Against Text-Image Person Re-Identification
Yubo Wang ⋅ Yan Lu ⋅ Bin Liu ⋅ Xulin Li ⋅ Jixiang Niu
Abstract
Text-Image Person Re-Identification (TI-ReID) models are widely deployed in intelligent surveillance.Built on deep neural networks and vision–language models, TI-ReID models inherit their vulnerability to adversarial attacks, posing potential security risks.Yet their security issues have received far less attention than retrieval accuracy, and the robustness of TI-ReID to adversarial attacks remains largely unexplored.To fill this gap, we propose Reconstruction-residual based Targeted and Untargeted Attack (R$^2$TUA), which takes an image and an adversarial text prompt as input and generates perturbations that make TI-ReID models incorrectly match the perturbed image to the identity described by the adversarial prompt.To precisely inject identity attributes into perturbations and achieve fine-grained targeted attack, R$^2$TUA proposes Transformer-based Gradual Multimodal Fusion (TGMF) that fuses image and adversarial prompt progressively across layers with tunable cross-modal weight.In addition, we propose a fully differentiable Soft Clamp Function (SCF), which enables us to ensure perturbations remain inconspicuous while avoiding local gradient vanishing effects that would trap training into suboptimal local minima.To further align perturbed images with the adversarial text descriptions while leading them to mismatch their original descriptions, R$^2$TUA employs Push-Pull Losses (PPLs) and matching losses during training.Extensive evaluations across multiple datasets and models demonstrate the superior untargeted attack and targeted attack performance of R$^2$TUA. It also exhibits strong adaptability and transferability against black-box models, outperforming all related attacks across multiple tasks.
Successful Page Load