VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
Abstract
Video agentic models have substantially advanced video-language understanding performance. However, most agentic approaches heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. Instead, we argue that leveraging the logical flow of videos allows models to use far fewer frames while maintaining, or even improving, their video understanding capability. In this paper, we introduce VideoSeek, a long-horizon video agent that actively seeks informative content via tool use, conditioned on the underlying logic flows throughout videos. Specifically, the VideoSeek agent follows a think–act–observe loop: it reasons over collected evidence to determine a tool-using plan, then acts by calling tools to gather new observations, and stops once it is sufficient to answer the given question. Experiments on four long-form video understanding and complex reasoning benchmarks demonstrate the superiority of VideoSeek. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Futher, a comprehensive analysis highlights the significance of leveraging logic flow, strong reasoning capability, and toolkit design for video agents.