Factorize, Reconstruct, Enhance: A Unified Framework for Multimodal Sentiment Analysis
Abstract
Multimodal Sentiment Analysis (MSA) aims to comprehensively and robustly interpret human emotions by integrating information from verbal, visual and acoustic modalities. However, the performance of existing models is often hampered by two key challenges: insufficient multilayer semantic extraction inherent to modalities and static feature fusion, leading to low performance. Therefore, this paper proposes a Multi-factor Factor-Decoupling and Semantics-enhanced Fusion Framework for accurate multimodal sentiment analysis. First, each modality is decomposed into three orthogonal subspaces based on a multidimensional information separation mechanism, which is regulated by a contrast constraint for subspace separation, an information gain constraint for maximizing the capture of task-relevant features, and a pairwise constraint for ensuring complementary subspaces. Subsequently, a variational purification strategy is introduced to further ensure the semantic integrity of each sentiment representation. Finally, the fusion module computes the adaptive fusion weights in parallel using multiple orthogonal factors such as sample-level modality saliency, global subspace type importance and feature-level internal attention. Extensive experiments on three datasets demonstrate the effectiveness of the proposed method.