Poster Sun, Jun 7, 2026 • 2:30 PM – 4:30 PM PDT ExHall A 157

Multi-Metric Representation Learning Strategy Based on Clustering for Fine-Grained Multimodal Sentiment Analysis

Yidan Wang ⋅ Zongheng Wang ⋅ Hongjie Xing ⋅ Chunguo Li ⋅ Xiaoxiao Liu

Abstract

Multimodal sentiment analysis (MSA) aims to identify human emotions through multimodal data. Despite considerable advances in MSA, we find that emotional class centers often overlap when integrating data from different modalities into the same representation space. In this paper, we propose a novel $\textbf{M}$ulti-$\textbf{M}$etric $\textbf{R}$epresentation l$\textbf{e}$arning $\textbf{s}$trategy based on clus$\textbf{t}$ering (MMRest) to alleviate this issue through flexible multi-metric representation learning, enabling the model to learn fine-grained sentiments. Specifically, we first design a module termed Multi-metric Multimodal learning on Clusters (MMC), which minimizes distances within similar sentiment pairs while maximizing dissimilar ones, aiming to learn a global metric and local metrics in each cluster from multimodal data. Afterwards, we develop a Projection and Decision-Level Fusion (PDLF) module, including two parts. One part utilizes the optimal global and local metrics to obtain a projection value. The other part combines the projection value with an intermediate score which is obtained through the fusion of unimodal and multimodal representations to obtain the final sentiment prediction score. Extensive experiments on the CMU-MOSI and CMU-MOSEI datasets demonstrate that our method is significantly superior to state-of-the-art methods on various evaluation indicators and parameter count, by effectively learning fine-grained emotional boundaries. The code will be made open-source if the paper is accepted.