ReaGEN: Adaptive Generation of Structured Chains-of-Thought for Efficient Multimodal Reasoning
Abstract
Large Vision Language Models (LVLMs) exhibit strong perceptual and linguistic capabilities yet struggle with complex visual reasoning tasks that require structured, compositional, and adaptive inference. Existing approaches either rely on costly inference-time exploration—such as multi-path or tree-based Chain-of-Thought (CoT) search—or on expensive post-training with large curated CoT datasets. We propose ReaGEN, a lightweight framework for the adaptive generation of structured reasoning chains that enhances reasoning without modifying the underlying vision–language model (VLM). ReaGEN first employs a teacher-guided evolutionary search to collect sample specific CoT structure, leveraging attention-derived stage importance to capture how information flows across reasoning stages. These adaptive CoT structures are then used to train a compact generator (GEN) that learns to refine and improve CoT structures by reflecting on attention feedback from the reasoning process. At inference, the GEN dynamically produces question-adaptive structured CoTs, and can be iteratively invoked to refine them based on the VLM’s internal state—achieving the flexibility of deep search with single-path efficiency. Across diverse multimodal reasoning benchmarks, ReaGEN achieves up to +26 accuracy points over test-time scaling methods while reducing the average inference-time token usage by 79\%, establishing a scalable and model-agnostic approach for structured reasoning generation in VLMs.