RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection
Abstract
Weakly supervised Human–Object Interaction (HOI) detection is vital for scalable scene understanding by learning interactions from only image-level annotations, i.e., no labels specifying which human–object instances are engaged in the interaction.Due to the lack of localization signals, prior works typically propose candidate pairs using an external object detector and then infer their interactions through pairwise reasoning.However, this framework often struggles to scale due to the substantial computational cost incurred by enumerating numerous instance pairs. In addition, it exhibits suboptimal performance due to false positives arising from non-interactive combinations, hindering its capability of instance-level HOI reasoning.To this end, we introduce Relational Grounding Transformer (RegFormer), a versatile interaction recognition module that enables efficient and accurate HOI reasoning.Under image-level supervision, RegFormer leverages spatially grounded implicit signals as guidance for the reasoning process, facilitating effective locality elicitation.Benefiting from the implicitly learned local interactions, our module can accurately distinguish humans, objects, and their interactions within their corresponding regions, enabling precise and efficient instance-level HOI reasoning without any additional training.Our extensive experiments and analysis demonstrate that RegFormer effectively learns spatial cues for instance-level interaction reasoning, operates with high efficiency, and even shows comparable performance compared to fully supervised models.