Progressive Cross-Modal Causal Intervention for Long-Term Action Recognition
Abstract
Intricate correlations among atomic actions and inherent visual confounders in long-term action recognition (LTAR) contribute to the persistent challenges in this domain. While methods based on vision-language models that employ label text for supervision offer potential for handling visual confounders, their reliance on statistical correlations rather than causal mechanisms introduces two vulnerabilities: (1) spurious alignments with non-causal co-occurring visual features during cross-modal interaction, and (2) misinterpretation of codependencies among actions. To address these limitations, this paper introduces Progressive Cross-Modal Causal Intervention (PCMCI). PCMCI first mitigates co-occurrence hallucination via causal intervention grounded in optimal transport theory. Subsequently, an action relation-aware mechanism counters the backdoor path induced by codependency illusion, enabling the derivation of deconfounded text embeddings. Finally, these deconfounded embeddings serve as mediator to implement front-door adjustment to remove visual confounders. This progressive causal intervention framework facilitates learning robust representations for LTAR. Experiments on three long-term action benchmarks demonstrate the effectiveness of the proposed model.