Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis
Abstract
Multimodal sentiment analysis (MSA) seeks to infer human emotions by integrating heterogeneous signals from text, audio, and visual modalities.Although recent approaches attempt to leverage cross-modal complementarity, they often struggle to fully utilize weaker modalities.In practice, the expressive power across modalities is inherently imbalanced: dominant modalities tend to overshadow non-verbal ones, which not only limits their contribution but also induces modality competition during training.This imbalance leads to degraded fusion performance and poor robustness under noisy or missing modalities.To address these challenges, we propose a novel model, Enhance-then-Balance Modality Collaboration framework (EBMC). EBMC first improves representational quality via modality semantic disentanglement (MSD) and cross-modal complementary enhancement (CCE), which strengthens weaker modalities using information from other modalities.To prevent dominant modalities from overwhelming others during joint optimization, EBMC introduces an Energy-guided Modality Coordination (EMC) mechanism that models modality contributions via energy potentials and achieves implicit gradient rebalancing through a differentiable equilibrium objective.Further, an Instance-aware Modality Trust Distillation (IMTD) module estimates sample-level modality reliability and adaptively modulates fusion weights, ensuring robustness against noise and modality incompleteness.Extensive experiments on multiple MSA benchmarks demonstrate that EBMC achieves state-of-the-art or competitive results.Moreover, EBMC maintains strong performance under missing-modality settings, highlighting its effectiveness and robustness.