Skip to yearly menu bar Skip to main content


Poster

Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations

Sangmin Lee · Bolin Lai · Fiona Ryan · Bikram Boote · James Rehg

[ ]
 
Oral presentation:

Abstract:

Understanding social interactions involving both verbal (e.g., language) and non-verbal (e.g., gaze, gesture) cues is crucial for developing social artificial intelligence that can engage alongside humans. However, most prior works on multimodal social behaviors focus predominantly on single-person behaviors or rely on holistic visual representations that are not densely aligned to utterances in multi-party environments. They are limited in modeling the intricate dynamics of multi-party interactions. In this paper, we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. We further propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal signals pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations for modeling social interactions. We will release our benchmarks and source code to facilitate further research.

Live content is unavailable. Log in and register to view live content