Skip to yearly menu bar Skip to main content


Understanding Video Transformers via Universal Concept Discovery

Matthew Kowal · Achal Dave · Rares Andrei Ambrus · Adrien Gaidon · Kosta Derpanis · Pavel Tokmakov

Arch 4A-E Poster #126
award Highlight
[ ] [ Project Page ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that are automatically discovered. Prior research on concept-based interpretability has concentrated solely on image-level tasks, like image classification. Comparatively, video models deal with the added temporal dimension, increasing complexity and posing challenges in identifying dynamic concepts over time.In this work, we systematically address these challenges by introducing the first Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose an efficient approach for unsupervised identification of units of video transformer representations - concepts. We then design a noise-robust algorithm for ranking the importance of these units to the output of a model, allowing us to analyze its decision making process. Performing this analysis jointly over a diverse set of supervised and self-supervised models we make a number of important discoveries about universal units of video representations. Finally, we demonstrate that VTCD can be used to improve model performance for fine-grained tasks.

Live content is unavailable. Log in and register to view live content