Skip to yearly menu bar Skip to main content


Poster

Sound Bridge: Association Egocentric and Exocentric Videos via Audio Cues

Sihong Huang · Jiaxin Wu · Xiaoyong Wei · Yi Cai · Dongmei Jiang · Yaowei Wang


Abstract:

Understanding human behavior and the environmental information in the egocentric video is very challenging due to the invisibility of some actions (e.g., laughing and sneezing) and the local nature of the first-person view. Leveraging the corresponding exocentric video to provide global context has shown promising results. However, existing visual-to-visual and visual-to-textual Ego-Exo video alignment methods struggle with the problem that there could be non-visual overlap for the same activity. To address this, we propose using sound as a bridge, as audio is often consistent across Ego-Exo videos. However, direct audio-to-audio alignment lacks context. Thus, we introduce two context-aware sound modules: one aligns audio with vision via a visual-audio cross-attention module, and another aligns text with sound closed caption generated by LLM. Experimental results on two Ego-Exo video association benchmarks show that either of the two proposed modules manages to improve the state-of-the-art methods. Moreover, the proposed sound-aware egocentric or exocentric representation boosts the performance of downstream tasks, such as action recognition of exocentric videos and scene recognition of egocentric videos. The code and models can be accessed at https://github.com/openuponacceptance.

Live content is unavailable. Log in and register to view live content