Poster
Cross-modal Causal Relation Alignment for Video Question Grounding
weixing chen · Yang Liu · Binglin Chen · Jiandong Su · Yongsen Zheng · Liang Lin
Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers. However, existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question. Moreover, although large models possess extensive prior knowledge and can demonstrate strong performance in a zero-shot setting, issues such as spurious correlations persist, making their application to specific downstream tasks challenging. In this work, we propose a novel causality-ware VideoQG framework named Cross-modal Causality Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding. Our CRA involves three essential components: i) Gaussian Smoothing Attention Grounding (GSAG) module for estimating the time interval via cross-modal attention, which is de-noised by an adaptive Gaussian filter. ii) Cross-modal Alignment (CA) enhances the performance of weakly supervised VideoQG by leveraging bidirectional contrastive learning between estimated video segments and QA features. iii) Explicit Causal Intervention (ECI) module for multimodal deconfounding, which involves front-door intervention for vision and back-door intervention for language. Extensive experiments on two VideoQG datasets demonstrate the superiority of our CRA in discovering visually grounded content and achieving robust question reasoning. Codes will be available.
Live content is unavailable. Log in and register to view live content