EduDiag: A Benchmark for Educational Diagnostic Reasoning with Error Tracing and Correction on Large Multimodal Models
Abstract
Large multimodal models (LMMs) have achieved impressive performance on multimodal reasoning, becoming crucial technology for the advancement of intelligent question-answering systems. In real-world educational scenarios, effective teaching extends far beyond providing answers. Experienced teachers analyze students' incorrect answers to trace underlying errors and provide corrective feedback, termed educational diagnostic reasoning, a capability that remains under-explored in existing LMMs. To bridge this research gap, we introduce Edudiag benchmark, requiring LMMs to reconstruct erroneous reasoning chains from incorrect answers and generate corrective feedback. Through an AI-assisted annotation pipeline with rigorous human verification, we create 8K erroneous reasoning chains and corresponding feedback, spanning three representative educational domains: commonsense, science, and mathematics.Extensive evaluation across 28 leading LMMs highlights \textit{Edudiag} as a challenging testbed, where even leading proprietary LMMs struggle on it and supervised fine-tuning (SFT) on open-source LMMs achieves marginal performance gains. Moreover, we conduct analysis experiments and identify three critical insights for educational diagnostic reasoning: (i) Effective error tracing remains the primary bottleneck, while SFT models still fail to reversely identify errors that commonly occur. (ii) Group relative policy optimization (GRPO) mitigates this bottleneck and boosts performance. (iii) LMMs optimized with GRPO can generate plausible yet challenging distractors for multiple-choice questions based on their self-constructed erroneous reasoning chains. We believe Edudiag provides a new direction for evaluating the advanced LMMs.