Skip to yearly menu bar Skip to main content


Retrieval-Augmented Egocentric Video Captioning

Jilan Xu · Yifei Huang · Junlin Hou · Guo Chen · Yuejie Zhang · Rui Feng · Weidi Xie

Arch 4A-E Poster #374
[ ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references.

Live content is unavailable. Log in and register to view live content