DiGraphHal-Bench: Evaluating Multimodal Large Language Models on Complex Directed Graphs
Abstract
While prior research on Multimodal Large Language Model (MLLM) hallucinations has primarily examined cross-modal inconsistencies in natural images, hallucination over complex graph structures remains underexplored.Concurrently, there is a lack of robust evaluation for fine-grained reasoning integrating structural, visual, and semantic information.To address these gaps, we present DiGraphHal-Bench, the first large-scale Visual Question Answering (VQA) benchmark for evaluating both hallucination phenomena and fine-grained reasoning of MLLMs on real-world directed graphs. DiGraphHal-Bench comprises high-quality procedural graphs from over six distinct domains and is organized around a taxonomy of four high-level capabilities and twelve fine-grained tasks. To ensure benchmark fidelity, we propose a novel two-stage automatic data curation pipeline that reconciles the trade-off between data scale and quality, thereby guaranteeing reliable evaluation.Experiments reveal that state-of-the-art MLLMs hallucinate notably in fine-grained graph reasoning. Although SFT substantially mitigates these hallucinations and strengthens complex reasoning, performance remains far from optimal. Ablation studies highlight the importance of fundamental capabilities for integrative reasoning, and our benchmark provides a foundation for advancing robust multi-modal graph understanding.