Skip to yearly menu bar Skip to main content


SnAG: Scalable and Accurate Video Grounding

Fangzhou Mu · Sicheng Mo · Yin Li

Arch 4A-E Poster #418
[ ]
Thu 20 Jun 5 p.m. PDT — 6:30 p.m. PDT


Temporal grounding of text descriptions in video is an important task amid vision-language learning, and remains a challenging problem in video understanding. Existing methods focus on grounding a few text queries within minute-long videos, yet fail to scale up to hour-long videos with hundreds of queries. In this paper, we present a systematic study for the design of scalable video grounding models. We compare design choices for cross-modal fusion, analyze their computational cost, and point out key insight and a new training scheme that enables scalable video grounding. We further present a simple model following our key findings. Our model attains superior accuracy and efficiency on recent benchmarks for long-form video grounding, while remaining highly competitive on previous benchmarks comprising short videos.

Live content is unavailable. Log in and register to view live content