Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models
Abstract
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities in Chain-of-Thought (CoT) reasoning. However, existing LVLM reasoning paradigms only begin reasoning after the entire video becomes available, introducing unnecessary latency and diminishing attention to early visual cues in dynamic scenes. Inspired by the human ability to think while watching, we introduce a streaming reasoning paradigm for LVLMs, where reasoning unfolds sequentially with incoming frames and deepens after the full video is observed. We instantiate this paradigm through Think-as-You-See (TaYS), a unified framework that enables LVLMs to reason while watching by integrating streaming CoT generation, stream-constrained training, and stream-parallel inference. Specifically, TaYS employs temporally aligned streaming reasoning units with precise CoT supervision, enforces ordered reasoning via streaming attention masks and positional encodings, and utilizes a parallel KV caches mechanism that decouples input encoding from reasoning generation, ensuring alignment and true concurrency. We evaluate TaYS on the Qwen2.5-VL model family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experimental results show that TaYS achieves superior reasoning performance compared with batch-mode CoT, while reducing pre-reasoning latency to under one second and overall answer delay by more than 50\%. These findings demonstrate the effectiveness of the streaming paradigm in enabling real-time, human-like reasoning for LVLMs.