Oral Session
Oral Session 1B: Interpretability and Evaluation
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
Pengfei Zhou · Xiaopeng Peng · Jiajun Song · Chuanhao Li · Zhaopan Xu · Yue Yang · Ziyao Guo · Hao Zhang · Yuqi Lin · Yefei He · Lirui Zhao · Shuo Liu · Tianhua Li · Yuxuan Xie · Xiaojun Chang · Yu Qiao · Wenqi Shao · Kaipeng Zhang
Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to data size and diversity limitations. To bridge this gap, we introduce OpenING, a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82. 42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation models. The benchmark, code and judge models will be released.
LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions
Faridoun Mehri · Mahdieh Baghshah · Mohammad Taher Pilehvar
Why do gradient-based explanations struggle with Transformers, and how can we improve them? We identify gradient flow imbalances in Transformers that violate FullGrad-completeness, a critical property for attribution faithfulness that CNNs naturally possess. To address this issue, we introduce LibraGrad—a theoretically grounded post-hoc approach that corrects gradient imbalances through pruning and scaling of backward paths, without changing the forward pass or adding computational overhead. We evaluate LibraGrad using three metric families: Faithfulness, which quantifies prediction changes under perturbations of the most and least relevant features; Completeness Error, which measures attribution conservation relative to model outputs; and Segmentation AP, which assesses alignment with human perception. Extensive experiments across 8 architectures, 4 model sizes, and 4 datasets show that LibraGrad universally enhances gradient-based methods, outperforming existing white-box methods—including Transformer-specific approaches—across all metrics. We demonstrate superior qualitative results through two complementary evaluations: precise text-prompted region highlighting on CLIP models and accurate class discrimination between co-occurring animals on ImageNet-finetuned models—two settings on which existing methods often struggle. LibraGrad is effective even on the attention-free MLP-Mixer architecture, indicating potential for extension to other modern architectures. Our code is freely available.
Do We Always Need the Simplicity Bias? Looking for Optimal Inductive Biases in the Wild
Damien Teney · Liangze Jiang · Florin Gogianu · Ehsan Abbasnejad
Common choices of architecture give neural networks a preference for fitting data with simple functions. This simplicity bias is known as key to their success. This paper explores the limits of this assumption. Building on recent work that showed that activation functions are the origin of the simplicity bias (Teney, 2024), we introduce a method to meta-learn activation functions to modulate this bias.Findings. We discover multiple tasks where the assumption of simplicity is inadequate, and standard ReLU architectures are therefore suboptimal. In these cases, we find activation functions that perform better by inducing a prior of higher complexity. Interestingly, these cases correspond to domains where neural networks have historically struggled: tabular data, regression tasks, cases of shortcut learning, and algorithmic grokking tasks. In comparison, the simplicity bias proves adequate on image tasks, where learned activations are nearly identical to ReLUs and GeLUs.Implications. (1) Contrary to common belief, the simplicity bias is not universally useful. There exist real tasks where it is suboptimal. (2) The suitability of ReLU models for image classification is not accidental. (3) The success of ML ultimately depends on the adequacy between data and architectures, and there may be benefits for architectures tailored to specific distributions of tasks.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Matt Deitke · Christopher Clark · Sangho Lee · Rohun Tripathi · Yue Yang · Jae Sung Park · Reza Salehi · Niklas Muennighoff · Kyle Lo · Luca Soldaini · Jiasen Lu · Taira Anderson · Erin Bransom · Kiana Ehsani · Huong Ngo · Yen-Sung Chen · Ajay Patel · Mark Yatskar · Chris Callison-Burch · Andrew Head · Rose Hendrix · Favyen Bastani · Eli VanderBilt · Nathan Lambert · Yvonne Chou · Arnavi Chheda-Kothary · Jenna Sparks · Sam Skjonsberg · Michael Schmitz · Aaron Sarnat · Byron Bischoff · Pete Walsh · Christopher Newell · Piper Wolters · Tanmay Gupta · Kuo-Hao Zeng · Jon Borchardt · Dirk Groeneveld · Crystal Nam · Sophie Lebrecht · Caitlin Wittlif · Carissa Schoenick · Oscar Michel · Ranjay Krishna · Luca Weihs · Noah A. Smith · Hannaneh Hajishirzi · Ross Girshick · Ali Farhadi · Aniruddha Kembhavi
Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present \textbf{Molmo}, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets, including a dataset of highly detailed image captions for pre-training called \textbf{PixMo}, a free-form image Q\&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code will all be released.
Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector
Xiao Guo · Xiufeng Song · Yue Zhang · Xiaohong Liu · Xiaoming Liu
Deepfake detection is a long-established research topic crucial for combating the spread of malicious misinformation. Unlike previous methods that provide either binary classification results or textual explanations for deepfake detection, we propose a novel method that delivers both simultaneously. Our method harnesses the multi-modal learning power of the pre-trained CLIP and the unprecedented interpretability of large language models (LLMs) to enhance both the generalization and interpretability of deepfake detection. Specifically, we introduce a multi-modal face forgery detector (M2F2-Det) that employs specially designed face forgery prompt learning, integrating zero-shot learning capabilities of the pre-trained CLIP to improve generalization to unseen forgeries.Also, M2F2-Det incorporates the LLM to provide detailed explanations for detection decisions, offering strong interpretability by bridging the gap between natural language and the subtle nuances of facial forgery detection. Empirically, we evaluate M2F2-Det for both detection and sentence generation tasks, on both of which M2F2-Det achieves state-of-the-art performance, showing its effectiveness in detecting and explaining diverse and unseen forgeries. Code and models will be released upon publication.