SCIEval: Evaluating and Benchmarking the Faithfulness of Scientific Image Generation and Interpretation with Large Multimodal Models
Abstract
Scientific images often require accurate numerical representations and correct object attributes, making them differ significantly from real-life images. However, existing faithfulness metrics for image generation or interpretation with large multimodal models (LMMs) focus mostly on real-life images, which makes them ill-suited for scientific image evaluations. For this, we propose SCIEval, a novel and unified faithfulness metric specifically designed for SCientific Image Evaluations. First, to fully capture faithfulness, we introduce three key aspects: (i) Relevance (R) which measures the overall text-image correspondence, (ii) Accuracy (A) which examines the details of scientific objects, and (iii) Explainability (E) which reveals unfaithful elements in the generated content. Consequently, we generate aspect-aware scientific text-image data to train three sub evaluators (SCIEval-R/A/E). Specifically, to train SCIEval-R and SCIEval-A, we propose a new SciCLIP framework, where we improve the scientific image perception of CLIP text and visual encoders via intra- and cross-modal contrastive learning. Meanwhile, to train SCIEval-E, we finetune a strong LMM using supervised rationale samples. Moreover, we present SCIEval-Bench, a human-annotated evaluation benchmark across 8 scientific domains, consisting of 3,000 scientific text-to-image samples from 4 LMMs (for image generation) and 3,000 scientific image captioning samples from 4 LMMs (for image interpretation). Experiments on SCIEval-Bench demonstrate that our SCIEval is more reliable and better correlated with human ratings compared to 24 competitors, including GPT-4o.