Principled Interpretability in Vision Models: From Mechanistic Understanding to Interpretable Models by Design
Abstract
As deep learning systems are increasingly deployed in high-stakes applications, understanding their behavior is critical for ensuring trust and safety. Interpretability provides essential tools to explain, debug, and improve these models. However, the field remains fragmented, spanning a wide range of methods and assumptions, while lacking standardized evaluation protocols. This tutorial aims to provide aunified overview of interpretability in deep learning– bridging post-hoc mechanistic understanding and methods to design inherently interpretable deep learning models.By the end of this tutorial, attendees will gain a solid understanding ofmodern interpretability methodsfor deep learning models, how torigorously evaluatethem, and open research directions in this critical area.