ChartR: Evaluating Reasoning Accuracy and Robustness in Chart Question Answering
Abstract
Chart Question Answering (CQA) benchmarks are critical for evaluating Multimodal Large Language Models (MLLMs) on visual data reasoning. Existing benchmarks focus mainly on final-answer correctness, ignoring intermediate reasoning steps and the propagation of errors in multi-step processes. To address this, we introduce \textbf{ChartR}, a benchmark designed to assess both the accuracy and robustness of reasoning in chart-understanding tasks. Each question is decomposed into 4–10 sub-questions covering key reasoning types, and each chart includes four visually perturbed variants (blurred, noise-added, watermark-added, annotation-removed) to systematically evaluate robustness. ChartR contains 200 base charts, 800 variants, 1,652 questions, and 8,260 image–question pairs. We further propose a comprehensive evaluation framework with eight metrics that evaluate reasoning-chain accuracy, robustness under visual perturbations, and enable analysis of potential error propagation patterns. Experiments on twelve MLLMs, including general-purpose and chart-specialized models, reveal low reasoning reliability, early-step errors that may propagate, value extraction as the primary bottleneck, and sharp performance drops under perturbations, highlighting reliance on textual cues over true visual understanding.