Seeing Motion Through Polarity for Event-based Action Recognition
Abstract
Event-based Action Recognition (EAR) provides a promising pathway for understanding dynamic behaviors under challenging conditions. Recent progress in vision-language models has introduced a cross-modal learning paradigm into EAR, enabling models to associate event streams with textual semantics for enhancing conceptual understanding. However, existing methods typically overlook the intrinsic polarity-driven motion cues that are fundamental to event data, leading to suboptimal spatiotemporal representations. To address this limitation, we propose a POlarity Knowledge Enhanced framework (POKER), which explicitly incorporates event polarity-aware motion knowledge across visual and textual modalities.POKER consists of two synergistic components: Polarity Motion Capturer (PMC) and Polarity Motion Reasoner (PMR). Specifically, PMC decouples positive and negative polarities to capture polarity-sensitive motion cues, while PMR semantically analyzes polarity-induced motion dynamics via large language models. Through the polarity alignment, POKER couples semantic reasoning with visual dynamics, achieving more discriminative representations. Extensive experiments on multiple benchmarks demonstrate that POKER enhances performance across diverse event representations.