Poster Sun, Jun 7, 2026 • 2:30 PM – 4:30 PM PDT ExHall A 465

ChartR: Evaluating Reasoning Accuracy and Robustness in Chart Question Answering

Xiaojun Chen ⋅ Sixiao Luo ⋅ Ziqi Liu ⋅ Min Yang ⋅ Qin Zhang ⋅ Liang-Jie Zhang

Abstract

Chart Question Answering (CQA) benchmarks are critical for evaluating Multimodal Large Language Models (MLLMs) on visual data reasoning. Existing benchmarks focus mainly on final-answer correctness, ignoring intermediate reasoning steps and the propagation of errors in multi-step processes. To address this, we introduce \textbf{ChartR}, a benchmark designed to assess both the accuracy and robustness of reasoning in chart-understanding tasks. Each question is decomposed into 4–10 sub-questions covering key reasoning types, and each chart includes four visually perturbed variants (blurred, noise-added, watermark-added, annotation-removed) to systematically evaluate robustness. ChartR contains 200 base charts, 800 variants, 1,652 questions, and 8,260 image–question pairs. We further propose a comprehensive evaluation framework with eight metrics that evaluate reasoning-chain accuracy, robustness under visual perturbations, and enable analysis of potential error propagation patterns. Experiments on twelve MLLMs, including general-purpose and chart-specialized models, reveal low reasoning reliability, early-step errors that may propagate, value extraction as the primary bottleneck, and sharp performance drops under perturbations, highlighting reliance on textual cues over true visual understanding.