Making the Classification Explanation Faithful to the Confidence Score
Abstract
Deep Neural Networks (DNNs) have revolutionized numerous industries, yet their decision-making processes remain largely opaque. Most existing explanation methods visualize the importance of image regions that influence a classifier's decisions, but they predominantly focus on identifying regions with positive contributions, often overlooking those with negative impacts. In this paper, we introduce a novel black-box explanation method, the Metropolis-Hastings Explainer (MHE), designed to provide confidence-faithful explanations. MHE enhances the fidelity of explanations by ensuring that the explained regions closely align with the original confidence score, sampling instances that best match the classifier’s confidence. Furthermore, MHE improves sampling efficiency by utilizing existing valid samples to explore more potential valid ones, reducing computational overhead. To enhance the clarity of explanations, MHE prioritizes valid samples with smaller areas when other factors are equal, thereby reducing the explanation area. Building upon the MHE framework, we propose two extensions: MHE-e, which focuses exclusively on regions with positive contributions, and MHE-pro, which refines explanation quality by integrating multi-scale information. MHE-pro progressively regions, optimizing both sampling efficiency and explanation quality. Experimental results demonstrate that MHE delivers superior and stable explanation quality across various models, including ResNet50, VGG16, ViT DINO, and CLIP, on datasets such as ImageNet, CUB-200-2011, and VOC2012, providing explanations that closely approximate the original classification confidence.