VisionLeaf: Entropy-Guided Leaf-First Reasoning for Efficient and Accurate Think-with-Image
Abstract
The "think-with-image” paradigm has recently gained traction for complex visual reasoning tasks. However, existing approaches often struggle with inference inefficiency due to a fixed number of redundant reasoning steps, as well as training instability.This challenge primarily arises from the direct use of standard reinforcement learning policies, which do not incorporate improvements for the think-with-image multi-turn conversational scenario.To address this challenge, we propose VisionLeaf, an entropy-guided, tree-based reasoning framework. Unlike conventional GRPO, where all nodes expand from the root and each leaf has only a single branch, our method grows the reasoning tree from the leaf nodes and selects the most valuable nodes based on entropy for thorough rollout exploration. This leaf-first expansion naturally aligns with the hierarchical nature of multi-step image analysis. Without modifying any model or training data, our VisionLeaf achieves a 4.2\% performance improvement on benchmarks such as VSTAR and HRBench, while reducing the number of inference rounds by nearly half—demonstrating significant gains in both accuracy and speed. All our code will be released.