Skip to yearly menu bar Skip to main content


Poster

Interleaved-modal Chain-of-Thought

Jun Gao · Yongqi Li · Ziqiang Cao · Wenjie Li


Abstract:

Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series of intermediate reasoning steps before arriving at the final answer.However, when transitioning to vision-language models (VLMs), their text-only rationales struggle to express the fine-grained associations with the original image.In this paper, we propose an image-incorporated multimodal Chain-of-Thought, named \textbf{Interleaved-modal Chain-of-Thought (ICoT)}, which generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer.Intuitively, the novel ICoT requires VLMs to enable the generation of fine-grained interleaved-modal content, which is hard for current VLMs to fulfill.Considering that the required visual information is usually part of the input image, we propose \textbf{Attention-driven Selection (ADS)} to realize ICoT over existing VLMs.ADS intelligently inserts regions of the input image to generate the interleaved-modal reasoning steps with ignorable additional latency.ADS relies solely on the attention map of VLMs without the need for parameterization, and therefore it is a plug-and-play strategy that can be generalized to a spectrum of VLMs.We apply ADS to realize ICoT on two popular VLMs of different architectures.Extensive evaluations of three benchmarks have shown that ICoT prompting achieves substantial performance (up to 14\%) and interpretability improvements compared to existing multimodal CoT prompting methods.

Live content is unavailable. Log in and register to view live content