Rounded or Streamlined Head? Bridging Concept Bottleneck Models and Attribute-Described Object Parts
Abstract
A faithful decision-making process requires models to ground human-understandable concepts both spatially (where they appear in the image) and causally (how they influence the prediction). Recent advances in Vision–Language Models (VLMs) enable concept-level alignment and have inspired Concept Bottleneck Models (CBMs), which explain predictions by mapping image representations to human-understandable concepts, allowing users to trace decisions through explicit semantic reasoning. However, existing CBMs suffer from two key inconsistencies. First, semantic inconsistency: VLMs often fail to localize fine-grained part–attribute concepts, producing noisy or incomplete masks. Second, object inconsistency: object-agnostic concepts such as "head: streamlined front profile" may describe multiple categories (e.g., fish or human); without enforcing object identity, non-targeted regions can introduce spurious evidence that corrupts the bottleneck representation. To address these challenges, we propose a new Object-Aware Concept Bottleneck Model (OA-CBM) that jointly enforces semantic- and object-level consistency. Specifically, (1) we redefine concepts as part–attribute pairs to enhance VLM robustness at the semantic level, and (2) introduce class-agnostic object clustering to suppress irrelevant visual evidence. We further annotate two grounding datasets with part–attribute descriptions and conduct extensive experiments. Results demonstrate that OA-CBM produces more faithful and robust explanations while maintaining competitive predictive performance.