Prototypical Action Reasoning Facilitated by Vision-Language Alignment for Egocentric Action Anticipation
Abstract
Egocentric Action Anticipation aims to infer future actions from videos, which is crucial for embodied AI systems. However, its advancement is hindered by the inherent stochasticity of the future, which introduces significant prediction uncertainty. Prevailing methods typically adopt an end-to-end approach to model holistic spatiotemporal contexts, yet they often lack explicit semantic reasoning capabilities, making it difficult to handle open-ended future uncertainties.To address these challenges, we propose a Prototypical Action Reasoning Framework Facilitated by Vision-Language Alignment (PAR-VLA), which leverages the semantic alignment capability of vision-language models to learn disentangled visual prototype for verbs and nouns. These prototypes serve as robust semantic anchors, transforming the unconstrained temporal prediction problem into a conditional forecasting task guided by well-defined semantic concepts. Our multi-stage framework first extracts visually-grounded and text-aligned prototype groups from a VLM, learning multiple prototypes per category to capture intra-class diversity. Subsequently, a novel Prototypical Action Reasoning-guided Verb-Noun Encoding branch dynamically retrieves the most relevant verb and noun concepts based on visual observations and explicitly models their interactions to guide temporal anticipation. Furthermore, we introduce Dual-Stream Symbiotic Predictive Decoders to more finely capture the interdependencies between verbs and nouns during the prediction process. Experiments Results demonstrate that PAR achieves state-of-the-art performance and exhibits a strong capability in dealing with future uncertainty.