Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events
Abstract
Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating key information across multiple modalities such as videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce CoE, a training-free MMS framework that performs structured reasoning through a Chain-of-Events guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into a structured prior that serves as a global scaffold for cross-modal reasoning.Guided by this hierarchy, CoE first performs cross-modal grounding to localize key visual cues, followed by event-evolution reasoning to capture temporal dependencies and causal transitions across the video.A lightweight style adaptation module further refines the generated summaries to match domain-specific linguistic conventions. Extensive experiments on eight diverse datasets demonstrate that CoE consistently outperforms state-of-the-art video CoT baselines, achieving average gains of +3.04 ROUGE, +9.51 CIDEr, and +1.88 BERTScore, highlighting its robustness, interpretability, and superior cross-domain generalization.