Diffusion Guided Chain-of-Vision for Large Autoregressive Vision Models
Abstract
Chain-of-Thought (CoT) has recently shown encouraging progress in the vision language model. However, the pure-vision CoT (i.e., chain-of-vision) has been underexplored in visual in-context learning. In this paper, we introduce Diffusion Guided Chain-of-Vision, which integrates an explicit chain-of-thought process into autoregressive vision models through vision prior from pre-trained diffusion models. Concretely, we find that pre-trained diffusion models induce a reliable probability flow in image space, where intermediate images sampled along this flow exhibit visual coherence and serve as task-free, chain-of-vision supervision for pure-vision autoregressive models. Extensive experiments on diverse vision tasks and multi-scale models validate the effectiveness of our proposed method for visual in-context learning. Code and dataset will be publicly available.