Poster Sun, Jun 7, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 484

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Zuhao Yang ⋅ Sudong Wang ⋅ Kaichen Zhang ⋅ Keming Wu ⋅ Sicong Leng ⋅ Yifan Zhang ⋅ Bo Li ⋅ Chengwei Qin ⋅ Shijian Lu ⋅ Xingxuan Li ⋅ Lidong Bing

Project Page Paper PDF

Abstract

Large multimodal models (LMMs) have shown great potential in video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucination, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos—by first skimming globally and then examining relevant clips for details—we introduce LongVT, an end-to-end agentic framework that sparks "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs’ inherent temporal grounding ability as a native video cropping tool to zoom in on specific video clips and resample finer-grained frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering data for long-video reasoning, we curated and will release a data suite named VideoSIAH to facilitate both training and evaluation. Our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.7K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning. Our evaluation benchmark contains 1,280 QA pairs carefully verified through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validations, LongVT consistently outperforms strong existing baselines across four challenging long-video understanding and reasoning benchmarks.