Beyond Missing Modalities: Hypergraph Conditioned Diffusion for Uncertainty-Aware Multimodal Emotion Recognition
Abstract
Multimodal Emotion Recognition in Conversations (MERC) aims to understand emotions expressed in each utterance by effectively integrating audio, text, and visual modalities. However, in real-world scenarios, unavoidable missing modalities often degrade multimodal interpretation performance. To address this, we propose \textbf{Hypergraph Diffusion and Evidence Fusion based Emotion Recognition (HyperEF)}, a novel framework designed to mitigate challenges arising from incomplete modalities in MERC. Specifically, to mitigate performance degradation caused by modality absence, we propose Masked Hypergraph Attention (MHGAT) conditioned diffusion model to recover latent features of missing modalities in the latent space. To ensure semantic consistency between recovered and available modalities within the same utterance, we introduce MHGAT that captures high-order semantic information from available modalities to guide the diffusion model’s denoising process. Furthermore, to disentangle and model the complex uncertainties inherent in MERC, we propose Dual Channel Evidence Fusion (DCEF), which estimates uncertainty at both feature source level and discriminative level, thereby achieving adaptive evidence fusion. Extensive comparative experiments and interpretability demonstrate the superior performance of our model in emotion recognition, as well as the contribution of each module within the model.