A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods
Abstract
Human-object interaction (HOI) detection has traditionally been addressed using task-specific models, sometimes augmented by early vision-language models such as CLIP. With the emergence of large, generative VLMs, a natural question arises: can standalone VLMs perform HOI detection effectively, and how do they compare to specialized HOI methods? Existing benchmarks like HICO-DET rely on exact label matching under incomplete annotations, counting any unmatched prediction as wrong. This leads to incorrect penalization, especially for VLMs whose outputs are less constrained, making fair comparison between the two paradigms difficult. To address this limitation, we introduce a multi-choice HOI benchmark with explicitly defined positives and curated negatives, enabling unified and correct evaluation of both VLMs and HOI-specific models. We further focus on challenging scenarios, such as multi-person scenes and fine-grained interaction distinctions, which are crucial for revealing real differences between the two paradigms. Experiments show that large VLMs achieve competitive, sometimes superior, zero-shot performance, yet they struggle with multiple concurrent actions and with correctly assigning interactions to the target person. Conversely, HOI-specific methods remain weaker in general HOI reasoning but demonstrate stronger multi-action recognition and more reliable identification of which person performs which action. These findings expose complementary strengths and weaknesses of VLMs and HOI-specific methods, which existing benchmarks fail to reveal due to incorrect penalization.