Poster
Action Detail Matters: Refining Video Recognition with Local Action Queries
Mengmeng Wang · Zeyi Huang · Xiangjie Kong · Guojiang Shen · Guang Dai · Jingdong Wang · Yong Liu
Video action recognition involves interpreting both global context and specific details to accurately identify actions. While previous models are effective at capturing spatiotemporal features, they often lack a focused representation of key action details. To address this, we introduce FocusVideo, a framework designed for refining video action recognition through integrated global and local feature learning. Inspired by human visual cognition theory, our approach balances the focus on both broad contextual changes and action-specific details, minimizing the influence of irrelevant background noise. We first employ learnable action queries to selectively emphasize action-relevant regions without requiring region-specific labels. Next, these queries are learned by a local action streaming branch that enables progressive query propagation. Moreover, we introduce a parameter-free feature interaction mechanism for effective multi-scale interaction between global and local features with minimal additional overhead. Extensive experiments demonstrate that FocusVideo achieves state-of-the-art performance across multiple action recognition datasets, validating its effectiveness and robustness in handling action-relevant details.
Live content is unavailable. Log in and register to view live content