Network Dissection: Quantifying Interpretability of Deep Visual Representations
David Bau*, Bolei Zhou*, Aditya Khosla, Aude Oliva, Antonio Torralba
Number of Passes: ✅✅✅
Why this paper?
From reading Prof. Christopher Manning's article, one section that caught my eye in particular was the discussion of meaning determined from the theory of reference and theory of meaning approaches. I then surmised if there was perhaps a fifth era of NLP where, instead of two orthogonal schools of thought, whether there was the possibility that techniques from both could be incorporated into a single framework, with prompting potentially being evidence that such structured semantic guidance is an indicator of the potential of this direction.
As an undergrad, I had assisted with research in the programming languages space; initially during my Master's, I spent the first semester working on parsing for formal grammars and taking an automated reasoning course before deciding NLP would be a better fit. The consequence is that I'm interested in an intersection between large language models and program synthesis that has manifested in the form of neurosymbolic programming research in recent years. In its opening section discussing the motivations and merit of this approach, the Neurosymbolic Programming textbook discusses comprehensible and actionable interpretability for models as an upside of such work, which is a specific aspect of this domain that I find compelling.
A part of this exploration so far has been to understand the current state of tools and frameworks for quantifying and qualifying interpretability for learning models. The Network Dissection paper seems to be quite seminal in this area, and thought it'd be a good opening work to learn more about this space.
Context
In prior years, there has been much work surrounding garnering visual evidence of low and high level features across intermediate layers of neural networks. Interpretability of vision models attracted much excitement because, at the time of this paper, it was evident that detectors for human concepts were showing up across <model, dataset> pairs across a variety of vision tasks.
While such phenomenon is exciting, a worthwhile follow up inquiry is what to make of it. From this high level starting point, the authors list three specific questions that drive the action items in their paper:
1) What is a disentangled representation, and how can its factors be quantified and detected?
2) Do interpretable hidden units reflect a special alignment of feature space, or are interpretations a chimera?
3) What conditions in state-of-the-art training lead to representations with greater or lesser entanglement?
I include a simplified, personal reinterpretation of these questions in slide 4 of the embedded Google Slides presentation below.
Problem Statement
Generally, the authors decide to define interpretability as the number of human recognized concepts of a dataset that are reflected by convolutional filters in a neural network.
This is addressed via two action items:
• How do we identify human recognized concepts and evaluate interpretability with concepts as the ground truth? -> Broden dataset
• How do we quantify whether a convolution filter unit is a detector for a concept? -> Network Dissection
Notes, Questions, & Looking Forward
Presentation slides I created for this paper when taking Princeton University's COS 597B: Recent Advances in Computer Vision class (Fall 2022), taught by Professor Olga Russakovsky.
These slides contain a summary of the paper, some discussion questions I sprinkled in, and a concluding slide that includes questions meant to address what questions are raised as a result of this paper.
For myself, I think the main action item is to read more follow up work to understand the subsequent, competing justifications and implementations for evaluating interpretability.