EmoThinker: Advancing Visual-Acoustic Emotion Analysis via Structural Token Selection and Chain-of-Thought Reasoning
Abstract
Multimodal Emotion Analysis (MEA) is crucial for human-centric AI, yet current methods struggle with two core challenges: the sparse nature of emotional cues across modalities and their inherent temporal asynchrony. Existing approaches, which often rely on implicit fusion, consequently suffer from diluted salient features and entangled representations. To address this issue, we propose EmoThinker, a new framework that advances MEA through explicit, structured reasoning. Our method introduces a structural token selection mechanism to concentrate on pivotal facial regions while refining background context, enhancing visual saliency and efficiency. For audio, an audio evidence extractor aggregates critical paralinguistic features into compact, emotion-rich tokens. More importantly, we enable step-by-step reasoning by constructing a Chain-of-Emotion-Thought dataset, which provides fine-grained annotations for disentangling asynchronous cues and resolving inter-modal conflicts. By decoupling evidence acquisition from reasoning, EmoThinker achieves a more interpretable and robust emotion analysis. Extensive experiments on multiple benchmarks demonstrate that our framework achieves new state-of-the-art performance.