Learning to Diversify and Focus: A Reinforcement Framework for Open-Vocabulary HOI Detection
Abstract
Open-Vocabulary Human–Object Interaction (OV-HOI) detection aims to recognize novel HOI categories beyond the training set.Existing OV-HOI detection approaches typically leverage CLIP to extract global visual representations and perform cross-attention between learnable queries and global features to localize human–object pairs.However, such one-stage paradigms tend to overfit seen interactions, limiting their generalization to unseen categories, while the coarse spatial awareness of CLIP also hinders the localization of fine-grained interaction cues.To address these issues, we propose a novel Semantic-Diversified and Interaction-Focused framework (SD-IF), which integrates reinforcement-guided adaptive optimization to jointly enhance semantic generalization and spatial discrimination.Specifically, we introduce a Semantic Diversification (SD) module that applies reinforcement-driven stochastic semantic perturbations and dual-level semantic exploration, expanding the semantic coverage of queries while maintaining visual coherence and effectively encouraging exploration beyond the seen semantic clusters.Furthermore, we design an Interaction Focusing (IF) module that formulates an actor–critic optimization scheme to adaptively refine attention distributions based on detection features and interaction representations, guided by a hybrid reward combining spatial focusing and semantic consistency.This cooperative learning paradigm enables the model to capture discriminative interaction cues and achieve spatially interpretable reasoning.Extensive experiments on two widely used benchmarks demonstrate that SD-IF achieves state-of-the-art performance, significantly surpassing existing OV-HOI detection methods.