CICA: Coupling Confidence-Aware Pretraining with Confidence-Informed Attention for Robust Multimodal Sentiment Analysis
Abstract
Multimodal sentiment analysis requires integrating language, visual, and acoustic cues, yet these modalities are often noisy, incomplete, or contradictory, making fusion unreliable. Most existing methods assume uniformly trustworthy modalities and thus degrade when signals conflict.To address this, we propose CICA, a framework that couples Confidence-Aware Pretraining with Confidence-Informed Attention. In pretraining, each modality encoder learns to estimate the reliability of its own representation, producing both embeddings and confidence scores. These scores then guide a confidence-informed attention mechanism, which strengthens contributions from reliable modalities while suppressing noisy or conflicting ones, enabling adaptive fusion under varying signal conditions.CICA achieves state-of-the-art performance across four major benchmarks on MOSI, MOSEI, CH-SIMS, and CH-SIMSv2. It achieves MAE 0.630 and Corr 0.855 on MOSI, and MAE 0.489 and Corr 0.856 on MOSEI, significantly surpassing prior methods. Consistent improvements are also observed across Acc-7, Acc-2, and F1 metrics. Under noisy and missing-modality conditions, CICA maintains significantly more stable performance, indicating improved robustness and interpretability.