Poster Fri, Jun 5, 2026 • 9:45 AM – 11:45 AM PDT ExHall A-F 218

Diffusion Guided Chain-of-Vision for Large Autoregressive Vision Models

Xinyang Wang ⋅ Kecheng Zheng ⋅ Minfeng Zhu ⋅ Wei Wu ⋅ Fan Lu ⋅ Wei Zhai ⋅ Wei Chen

Paper PDF

Abstract

Chain-of-Thought (CoT) has recently shown encouraging progress in the vision language model. However, the pure-vision CoT (i.e., chain-of-vision) has been underexplored in visual in-context learning. In this paper, we introduce Diffusion Guided Chain-of-Vision, which integrates an explicit chain-of-thought process into autoregressive vision models through vision prior from pre-trained diffusion models. Concretely, we find that pre-trained diffusion models induce a reliable probability flow in image space, where intermediate images sampled along this flow exhibit visual coherence and serve as task-free, chain-of-vision supervision for pure-vision autoregressive models. Extensive experiments on diverse vision tasks and multi-scale models validate the effectiveness of our proposed method for visual in-context learning. Code and dataset will be publicly available.