Poster
Hierarchy-Aware Evaluation of Free-Form Predictions From Vision-And-Language Models
Vésteinn Snæbjarnarson · Kevin Du · Niklas Stoehr · Serge Belongie · Ryan Cotterell · Nico Lang · Stella Frank
When a vision-and-language model (VLM) is prompted to identify an entity in an image, it may err on the side of caution and answer with "tree", instead of a more specific description such as "Pine tree''. Traditional binary accuracy metrics cannot differentiate between wrong predictions and insufficiently specific ones. They also do not give partial credit for close answers: "pine tree'' for a Norway Spruce should be better than "cypress'', taxonomically speaking, but string matching-based similarity measures will reject both equally.To address this shortcoming, we propose a framework for evaluating open-ended text predictions against a taxonomic hierarchy,using measures of hierarchical precision and recall to measure the level of correctness and specificity of predictions.We first show that existing text similarity measures and accuracy-based evaluation metrics do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the free-form outputs and the ground truth labels.Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our taxonomic evaluation. We find that models respond differently to instructions prompting for more specific answers, with GPT4V responding most specifically and others showing a trade-off between hierarchical precision and recall.
Live content is unavailable. Log in and register to view live content