Poster
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
Shuming Liu · Chen Zhao · Tianqi Xu · Bernard Ghanem
Large vision-language models (VLMs) have shown promising progress in various video understanding tasks. However, their potential for long-form video analysis is limited by their high computational resource requirements and constrained context windows. Traditional methods, particularly uniform frame sampling, often allocate resources to irrelevant content, reducing effectiveness in real-world scenarios. This paper introduces BOLT to BOost Large VLMs without additional Training through an extensive study of frame selection strategies for large VLMs. To provide a realistic evaluation of VLMs in long-form video understanding, we first present a multi-source retrieval evaluation setting. Our findings show that uniform sampling significantly underperforms when dealing with noisy contexts, highlighting the importance of selecting the right frames. Furthermore, we introduce several frame selection strategies based on query-frame similarity and analyze their effectiveness in enhancing VLM performance without retraining. We find that inverse transform sampling with refined query descriptions yields the most substantial performance improvement, boosting accuracy on the Video-MME benchmark from 49.94% to 53.8%. Our code will be released.
Live content is unavailable. Log in and register to view live content