Prompt-Anchored Vision–Text Distillation for Lifelong Person Re-identification
Abstract
Lifelong person re-identification (LReID) aims to train a generalizable model with sequentially collected data. However, such models often suffer from semantic drift, limited adaptability, and catastrophic forgetting as new domains emerge. Existing exemplar-free approaches mainly focus on visual encoder distillation or parameter regularization, while overlooking the potential of auxiliary modalities, such as text, to preserve semantic stability and enable incremental plasticity. We observe that the frozen text encoder in pretrained vision–language models can serve as a stable semantic anchor, offering consistent guidance throughout lifelong learning. To leverage the synergy between vision and text, we propose Prompt-Anchored vision–text Distillation (PAD), a unified framework that enhances semantic alignment and cross-domain generalization. On the textual side, we distill semantic prompts that maintain vision–text alignment under a fixed semantic coordinate system. On the visual side, an EMA-based teacher performs model distillation assisted by an adaptive prompt pool that allocates new slots for each incoming domain while freezing past ones, achieving both adaptability and memory retention. Extensive experiments demonstrate that our PAD substantially outperforms state-of-the-art methods across multiple LReID benchmarks.