SMV-EAR: Bring Spatiotemporal Multi-View Representation Learning into Efficient Event-Based Action Recognition
Rui Fan ⋅ Weidong Hao ⋅ Juntao Guan ⋅ Lai Rui ⋅ Tong Wu ⋅ Fanhong Zeng ⋅ Lin Gu
Abstract
Event cameras action recognition (EAR) offers compelling privacy-protecting and efficiency advantages, where temporal motion dynamics is of great importance. Existing spatiotemporal multi-view representation learning (SMVRL) methods for event-based object recognition (EOR) offer promising solutions by projecting $H$-$W$-$T$ events alone spatial axis $H$ and $W$, yet are limited by its translation-variant spatial binning representation and naive early concatenation fusion architecture. This paper reexamines the key SMVRL design stages for EAR and propose: (i) a principled spatiotemporal multi-view representation through translation-invariant dense conversion of sparse events, (ii) a dual-branch, dynamic fusion architecture that models sample-wise complementarity between motion features from different views, and (iii) a bio-inspired temporal warping augmentation that mimics speed variability of real-world human actions. On three challenging EAR datasets of HARDVS, DailyDVS-200 and THU-EACT-50-CHL, we show +7.0\%, +10.7\%, and +10.2\% Top-1 accuracy gains over existing SMVRL EOR method with surprising 30.1\% reduced parameters and 35.7\% lower computations, establishing our framework as a novel and powerful EAR paradigm. Code will be released once accepted.
Successful Page Load