Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition
Abstract
Multimodal intent recognition aims to infer human intents by jointly modeling various modalities, playing a pivotal role in real-world dialogue systems. However, current methods struggle to model hierarchical semantics underlying complex intents and lack the capacity for self-evolving reasoning over multimodal representations. To address these issues, we propose HIER, a novel method that integrates HIerarchical semantic representation with Evolutionary Reasoning based on Multimodal Large Language Model (MLLM). Inspired by human cognition, HIER introduces a structured reasoning paradigm that organizes multimodal semantics into three progressively abstracted levels. It starts with modality-specific tokens extracted by encoders, capturing localized semantic cues. These tokens are then abstracted into semantic concepts via a label-guided clustering strategy, yielding mid-level intent-aware patterns. To capture higher-order structure, inter-concept relations are selected through a JS-divergence-based mechanism to highlight salient dependencies across concepts. These hierarchical representations are then injected into MLLM via CoT-driven prompting, enabling step-wise reasoning. Besides, HIER utilizes a self-evolution mechanism that refines semantic representations through MLLM feedback, allowing dynamic adaptation during inference. Experiments on three challenging benchmarks show that HIER not only consistently outperforms state-of-the-art methods and MLLMs with 1–3% gains across all metrics, but also exhibits strong generalization across diverse backbones.