Poster Sun, Jun 7, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 106

SCIEval: Evaluating and Benchmarking the Faithfulness of Scientific Image Generation and Interpretation with Large Multimodal Models

Guanghui Ye ⋅ Huan Zhao ⋅ Zhixue Zhao ⋅ Tengfei Ma ⋅ Kehan Wang ⋅ Steffen Eger ⋅ Zhihua Jiang

Abstract

Scientific images often require accurate numerical representations and correct object attributes, making them differ significantly from real-life images. However, existing faithfulness metrics for image generation or interpretation with large multimodal models (LMMs) focus mostly on real-life images, which makes them ill-suited for scientific image evaluations. For this, we propose SCIEval, a novel and unified faithfulness metric specifically designed for SCientific Image Evaluations. First, to fully capture faithfulness, we introduce three key aspects: (i) Relevance (R) which measures the overall text-image correspondence, (ii) Accuracy (A) which examines the details of scientific objects, and (iii) Explainability (E) which reveals unfaithful elements in the generated content. Consequently, we generate aspect-aware scientific text-image data to train three sub evaluators (SCIEval-R/A/E). Specifically, to train SCIEval-R and SCIEval-A, we propose a new SciCLIP framework, where we improve the scientific image perception of CLIP text and visual encoders via intra- and cross-modal contrastive learning. Meanwhile, to train SCIEval-E, we finetune a strong LMM using supervised rationale samples. Moreover, we present SCIEval-Bench, a human-annotated evaluation benchmark across 8 scientific domains, consisting of 3,000 scientific text-to-image samples from 4 LMMs (for image generation) and 3,000 scientific image captioning samples from 4 LMMs (for image interpretation). Experiments on SCIEval-Bench demonstrate that our SCIEval is more reliable and better correlated with human ratings compared to 24 competitors, including GPT-4o.