RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs
Logan Lawrence ⋅ Oindrila Saha ⋅ Rangel Daroya ⋅ Mustafa Chasmai ⋅ Wuao Liu ⋅ Max Hamilton ⋅ Aaron Sun ⋅ Seoyun Jeong ⋅ Fabien Delattre ⋅ Subhransu Maji ⋅ Grant Horn
Abstract
Fine-grained bird species identification in the wild is frequently unanswerable from a single image: key cues may be non-visual (vocalization, range, season), or obscured due to occlusion, camera angle, or low resolution. Yet today’s multimodal systems are typically judged on answerable, in-schema cases, encouraging confident guesses rather than principled abstention. We propose the RealBirdID benchmark: given an image of a bird, a system should either answer with a species or abstain with a concrete, evidence-based rational (e.g., “requires vocalization,” “out of range,” “view obstructed”). For each genus, the dataset includes a validation split composed of curated unanswerable examples with labeled rationales, paired with a companion set of clearly answerable instances. We find that (1) the species identification on the answerable set is challenging for a variety of open-source and proprietary models ($\leq 17\%$ accuracy including GPT-5 and Gemini-2.5 Pro), (2) models with greater classification ability are not necessarily more calibrated to abstain from unanswerable examples, and (3) that MLLMs generally fail at providing correct reasons even when they do abstain. RealBirdID establishes a focused target for abstention-aware fine-grained recognition and a recipe for measuring progress.
Successful Page Load