Learning to Select Visual Tools from Experience
Abstract
We introduce VisualToolAgent (VisTA), a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and compose tools from a diverse library based on empirical performance. Existing methods for tool-augmented visual reasoning either rely on training-free prompting or large-scale supervised fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, guided solely by task outcomes. Leveraging reinforcement learning with verifiable rewards (RLVR), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, MathVerse, and BlindTest benchmarks demonstrate that VisTA achieves significant performance gains over training-free and fine-tuning baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.