Skip to yearly menu bar Skip to main content


Bilateral Adaptation for Human-Object Interaction Detection with Occlusion-Robustness

Guangzhi Wang · Yangyang Guo · Ziwei Xu · Mohan Kankanhalli

Arch 4A-E Poster #367
[ ]
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT


Human-Object Interaction (HOI) Detection constitutes an important aspect of human-centric scene understanding, which requires precise object detection and interaction recognition. Despite increasing advancement in detection, recognizing subtle and intricate interactions remains challenging. Recent methods have endeavored to leverage the rich semantic representation from pre-trained CLIP, yet fail to efficiently capture finer-grained spatial features that are highly informative for interaction discrimination. In this work, instead of solely using representations from CLIP, we fill the gap by proposing a spatial adapter that efficiently utilizes the multi-scale spatial information in the pre-trained detector. This leads to a bilateral adaptation that produces complementary features. Moreover, we design a Conditional Contextual Mining module that further mines informative contextual clues from the spatial features via a tailored cross-attention mechanism. To further improve interaction recognition under occlusion, which is common in crowded scenarios, we propose an Occluded Part Extrapolation module that guides the model to recover the spatial details from manually occluded feature maps. Extensive experiments on V-COCO and HICO-DET benchmarks demonstrate that our method significantly outperforms prior art on both traditional and zero-shot settings, resulting in new state-of-the-art performance. Additional ablation studies further validate the effectiveness of each component in our method.

Live content is unavailable. Log in and register to view live content