InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
Ashutosh Kumar ⋅ Rajat Saini ⋅ Jingjing Pan ⋅ Mustafa Erdogan ⋅ Mingfang Zhang ⋅ Betty Le ⋅ Norimasa Kobori ⋅ Quan Kong
Abstract
Current vision–language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce $\textbf{InstAP}$, an $\textbf{Inst}$ance-$\textbf{A}$ware $\textbf{P}$re-training framework that jointly optimizes global image–text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial–temporal regions. To support this, we present $\textbf{InstVL}$, a large-scale dataset ($2$ million images, $50,000$ videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.
Successful Page Load