RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval
Yijiang Li ⋅ Kunal Kotian ⋅ Ali Marjaninejad ⋅ Meir Friedenberg ⋅ Kaushik Pavani ⋅ Sunny Dasgupta
Abstract
Current multimodal image retrieval benchmarks focus on relatively simple queries where target images are either described directly or by simple composition with an input image. When retrieval requires complex reasoning to determine the target image, the task becomes significantly more challenging, yet standardized benchmarks for this setting do not exist. To fill this gap, we introduce RMIR, a benchmark dataset of $1,634$ queries requiring reasoning across three categories: functional (object affordances), temporal (time-based relationships), and causal (cause-effect reasoning). Each query combines visual and textual inputs that demand robust visual understanding together with logical inference, beyond surface-level matching, to identify correct target images. Evaluation of state-of-the-art models on RMIR reveals significant performance gaps, with the best model achieving only $46.53$\% recall@$20$ averaged across reasoning categories. Our systematic analysis exposes fundamental limitations in current multimodal retrieval systems and establishes RMIR as a challenging testbed for developing multimodal, reasoning-capable retrieval models.
Successful Page Load