Skip to yearly menu bar Skip to main content


Contextual Augmented Global Contrast for Multimodal Intent Recognition

Kaili Sun · Zhiwen Xie · Mang Ye · Huyin Zhang

Arch 4A-E Poster #271
[ ]
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT


Multimodal intent recognition (MIR) aims to perceive the human intent polarity via language, visual, and acoustic modalities. The inherent ambiguity of intent makes it challenging to recognize in multimodal scenarios. Existing MIR methods tend to model the individual videos independently, ignoring the contextual information across the videos. This learning manner inevitably introduces perception biases, exacerbated by inconsistencies in multimodal information, amplifying uncertainty in intent understanding.This challenge motivates us to explore effective global context modeling. Thus, we propose a context-augmented global contrast (CAGC) method to capture rich global context features by mining both intra-and cross-video context interactions for MIR. Concretely, we design a context-augmented transformer module to extract global context dependencies across videos. To further alleviate error accumulation and interference, we develop a cross-video bank that retrieves effective video sources by considering both intentional tendency and video similarity. Furthermore, we introduce a global context-guided contrastive learning scheme, designed to mitigate inconsistencies arising from global context representations and individual modalities in different feature spaces.This scheme incorporates global cues as supervision, ensuring the effectiveness of global contextual information while also enhancing the consistency learning. Experiments demonstrate CAGC obtains superior performance than state-of-the-art MIR methods.We also generalize our approach to a closely related task, multimodal sentiment analysis, achiveing the comparable performance.

Live content is unavailable. Log in and register to view live content