Unlocking Motion from Large Vision Models with a Semantic and Kinematic Duality for Gait Recognition
Abstract
Existing set-based gait recognition methods achieve remarkable performance by capturing global semantic context.However, their order-invariant nature prevents them from modeling the fine-grained kinematic patterns that unfold over time.To unify the global and process-level representations, we propose GaitMax, a framework that captures both semantic context and kinematic motion.GaitMax leverages attention-based spatiotemporal modeling to dynamically represent detailed part-level trajectories.While this detailed representation is more powerful, it also captures more nuisance factors (e.g., clothing, viewpoint), leading to potential shortcuts.To mitigate this, we introduce CDLoss, a Conditional Decorrelation Loss that explicitly disentangles the gait embeddings from nuisance factors using vision-language supervision.This loss requires high-quality nuisance descriptions. We therefore construct GCaption, a new resource that provides natural language annotations for multiple gait datasets, moving beyond simple categorical labels. GCaption not only enables CDLoss but also serves as a foundation for future context-aware gait analysis.The superiority of GaitMax is validated through extensive experiments on multiple large-scale gait benchmarks. Models, code, and resources will be released upon publication.