Paper
in
Workshop: Exploring the Next Generation of Data
Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models
Yuxiang Lin · Jingdong Sun · Zhi-Qi Cheng · Jue Wang · Haomin Liang · Zebang Cheng · Yifei Dong · Jun-Yan He · Xiaojiang Peng · Xian-Sheng Hua · Xian-Sheng Hua
Most existing emotion analysis emphasizes \emph{which} emotion arises (e.g., happy, sad, angry) but neglects the deeper \emph{why}. We propose \textbf{Emotion Interpretation (EI)}, focusing on \emph{causal factors}—whether explicit (e.g., observable objects, interpersonal interactions) or implicit (e.g., cultural context, off-screen events)—that drive emotional responses. Unlike traditional emotion recognition, EI tasks require \emph{reasoning about triggers} instead of mere labeling. To facilitate EI research, we present \textbf{EIBench}, a large-scale benchmark encompassing \num{1615} \emph{basic} EI samples and \num{50} \emph{complex} EI samples featuring multifaceted emotions. Each instance demands rationale-based explanations rather than straightforward categorization. We further propose a \emph{Coarse-to-Fine Self-Ask (CFSA)} annotation pipeline, which guides Vision-Language Models (VLLMs) through iterative question-answer rounds to yield high-quality labels at scale. Extensive evaluations on open-source and proprietary large language models under four experimental settings reveal consistent performance gaps—especially for more intricate scenarios—underscoring EI's potential to enrich empathetic, context-aware AI applications. Our benchmark and methods are publicly available at \href{https://github.com/Lum1104/EIBench}{https://github.com/Lum1104/EIBench}, offering a foundation for advanced multimodal causal analysis and next-generation affective computing.