HOPS: Hierarchical Open-vocabulary Part Segmentation with Attention-Aware Filtering and Affinity-Guided Enhancement
Abstract
Open-vocabulary part segmentation (OVPS) aims to segment objects into fine-grained parts while generalizing to unseen categories. Existing VLM-based methods face two challenges: (1) object over-segmentation, caused by overly broad semantic activations, and (2) part under-segmentation, resulting from weak fine-grained perception. To address these issues, we propose HOPS, a two-stage framework for hierarchical open-vocabulary part segmentation. HOPS introduces a bidirectional semantic–structural attention fusion mechanism that integrates CLIP’s semantic alignment with DINO’s structural perception. In the object segmentation stage, the Attention-Aware Filtering Module (AFM) refines cross-modal similarity maps via semantic–structural attention to suppress object over-segmentation. In the part segmentation stage, the Affinity-Guided Enhancement Module (AEM) iteratively propagates part responses to progressively expand activation regions, effectively mitigating part under-segmentation. Experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet demonstrate that HOPS achieves state-of-the-art performance with superior generalization.