Poster Sun, Jun 7, 2026 • 2:30 PM – 4:30 PM PDT ExHall A 466

Think Visually, Reason Textually: Vision-Language Synergy in Abstract Reasoning

Beichen Zhang ⋅ Yuhang Zang ⋅ Xiaoyi Dong ⋅ Yuhang Cao ⋅ Haodong Duan ⋅ Dahua Lin ⋅ Jiaqi Wang

Paper PDF

Abstract

Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4.These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence.The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks.Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles.However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution.This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages:vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution.Building on this insight, we introduce two synergistic strategies:(1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and(2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction.Extensive experiments demonstrate that our approach yields up to a 4.33\% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks.Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models.