Will Multimodal Models Be Dazzled by Multi-Image Visual Puzzles?
zhi zhu ⋅ YaoQi Fan ⋅ Zhe Chen ⋅ Yue Cao ⋅ Yangzhou Liu ⋅ Tong Lu
Abstract
The rapid advancement of Multimodal Large Language Models (MLLMs) has revealed the limitations of existing benchmarks in evaluating complex reasoning over multiple images. To address this gap, we introduce $\textbf{MIRACLE}$, a novel benchmark for Multi-Image complex Reasoning And Comprehension Logic Evaluation, featuring 4,000 questions across diverse reasoning types such as visual comparison, temporal sequencing, and spatial relations, with each question involving an average of seven tightly correlated images. MIRACLE emphasizes strong inter-image dependencies through a systematic data collection process, followed by delicate instance grouping and question design that enforce cross-image reasoning.Evaluation on leading MLLMs shows that even top-performing models like Gemini-2.5-Pro achieve only 55.91\% points, highlighting the significant challenges of multi-image reasoning. Moreover, in scenarios characterized by high visual information density, such as puzzle tasks and ultra multi-image input conditions, all models exhibit a significant drop in performance, which highlights the limitations of MLLMs in handling complex structural relations and collaborative reasoning, revealing deficiencies in their cognitive capabilities under high-load visual reasoning settings. We hope MIRACLE will inspire the community to push the boundaries of multi-image reasoning. The benchmark shall be released.
Successful Page Load