Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
Abstract
We introduce \textbf{Omni-MMSI}, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech. The task involves two tightly coupled goals: extracting identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to).This task is essential for developing AI assistants that can perceive and respond to human interactions.Unlike prior studies that assume identity-attributed social cues perfectly provided, Omni-MMSI reflects realistic scenarios where AI assistants must perceive from raw multi-modal streams and reason over extracted social cues.However, existing pipelines and multi-modal LLMs perform poorly in this setting because they lack reliable identity attribution ability, which leads to inaccurate social cues and weak interaction reasoning.To address this challenge, we propose \textbf{Omni-MMSI-R}, a reference-based pipeline that uses reference audio-vision pairs to produce identity-attributed social cues and leverages curated chain-of-thought supervision for reasoning on reference-based inputs. To enable reference-based research, we construct participant-level reference pairs and curated reasoning annotations on top of the existing datasets.Extensive experiments demonstrate that Omni-MMSI-R consistently outperforms advanced multi-modal LLMs and counterparts in Omni-MMSI.