GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models
Abstract
Unified Multimodal Models (UMMs) are redefining the landscape of artificial intelligence by coupling perception and generation across language, vision, and structured reasoning. Yet, despite their growing sophistication, a critical gap persists in evaluation: existing benchmarks largely measure discriminative understanding or unconstrained generation in isolation, overlooking the integrated generative reasoning required for genuine multimodal intelligence. To address this, we introduce GGBench, the benchmark explicitly designed to evaluate geometric generative reasoning—the ability of a model to understand, reason about, and construct a solution within a unified framework. Each instance in GGBench contains precisely aligned natural-language instructions, executable GeoGebra code, and rendered diagrams, enabling deterministic and interpretable verification of a model’s reasoning and constructive fidelity. The benchmark comprises 1,411 rigorously curated problems covering eight categories and multiple difficulty levels, resulting in over 7,000 aligned visualizations. We propose a comprehensive tri-modal evaluation protocol that jointly assesses textual planning quality, code executability, and geometric accuracy of generated diagrams through both automated and human-in-the-loop judging. Extensive experiments on both state-of-the-art UMMs and general Large Language Models (LLMs) reveal a large performance gap between end-to-end generation and reasoning-grounded construction. GGBench establishes a new standard for testing multimodal systems that must not only understand but also build, marking a crucial step toward grounded, verifiable generative intelligence.