Prototype-as-Prompt: Multimodal Sentiment Prototypes Endowing Large Language Models the Capability to Perform Multimodal Sentiment Analysis
Abstract
Multimodal Sentiment Analysis (MSA) aims to integrate textual, acoustic, and visual information to predict sentiment polarity. With the emergence of Large Language Models (LLMs), existing studies commonly employ learnable queries to compress audio–visual representations and feed them as soft prompts into LLMs for MSA. However, due to the implicit learning mechanism of the learnable queries, these learnable queries lack explicit guidance regarding how each query encodes sentiment semantics. To address this issue, we propose a prototype-as-prompt framework that maps audio–visual representations into a fixed set of multimodal sentiment prototypes. These prototypes are then used as soft prompts to guide the LLM in performing MSA. Concretely, we first compress both textual and non-textual features into multimodal prototypes using a resampling-based strategy. We further introduce a sentiment-aware prototype learning that explicitly binds multimodal prototypes with sentiment semantics. To ensure both cross-modal consistency and intra-modal diversity of multimodal sentiment prototypes, we design a cross-modal prototype alignment constraint and a distance-weighted prototype diversity constraint. Extensive experiments across three LLMs and four benchmark datasets show that PaP achieves superior performance with only 0.09\%–0.26\% of trainable parameters, highlighting its effectiveness and parameter efficiency.