Seeing Conversations: Communication Context Identification in Egocentric Video
Abstract
In everyday conversations, humans effortlessly recognize communication partners using visual cues such as gaze or head orientation. Replicating this social reasoning in computer vision is challenging, especially in dynamic, multi-person settings. We introduce Communication Context Identification (CCI) in egocentric vision: Given a first-person video sequence, determine which individuals are engaged in communication with the camera wearer. To support CCI, we collected a challenging large-scale dataset comprising 68.9 hours of egocentric video captured across diverse multi-person, multi-conversation scenarios.We propose CoCoNet, a temporal interaction model for CCI that tracks social dynamics via attention across individuals over long time scales. CoCoNet flexibly handles varying group sizes, maintains predictions through occlusions, and performs robustly even with limited temporal input. Leveraging long temporal contexts, it achieves 96\% balanced accuracy on CCI. Performance varies with group size and spatial scene layout, highlighting the importance of dataset diversity. Our work advances vision-based conversational awareness, enabling applications in assistive hearing that use egocentric video to enhance individuals in the user’s conversation group.