PRISM: Prototype-based Reasoning with Inter-modal Semantic Mining for Interpretable Image Recognition
Abstract
Prototype-based methods enhance interpretability in image recognition by establishing intermediate part prototypes to build interpretable classifiers, enabling transparent reasoning through part-level attention and reference to prototypical examples. However, existing methods typically depend on unimodal visual supervision and constrain prototypes within the visual embedding space, which inherently restricts their semantic alignment with human-interpretable concepts. In this work, we present PRISM (Prototype-based Reasoning with Inter-modal Semantic Mining), an interpretable image recognition framework that leverages natural language as an auxiliary modality to guide the learning of class-specific part prototypes. PRISM introduces an information-theoretic attribution mechanism that identifies semantically salient image regions conditioned on textual descriptions. By aligning these attribution maps with prototype activation patterns, PRISM implicitly anchors visual part prototypes to conceptually meaningful image regions, enhancing interpretability without requiring explicit concept modeling. To further enhance the distinctiveness and localization of prototypes, we introduce a spatial compactness constraint that encourages each prototype to attend to specific, non-overlapping image regions. Extensive experiments on fine-grained benchmarks demonstrate that the proposed PRISM not only improves classification performance but also provides faithful and semantically grounded visual explanations.