A Multimodal Automated Interpretability Agent

Tamar Rott Shaham^*, Sarah Schwettmann^*,
Franklin Wang, Achyuta Rajaram, Evan Hernandez, Jacob Andreas, Antonio Torralba

* indicates equal contribution.

MIT CSAIL

ICML 2024

Paper arXiv Code

Experiment Browser

How can AI systems help us understand other AI systems?

Interpretability Agents automate aspects of scientific experimentation to answer user queries about trained models. See the experiment browser for more experiments.

Understanding of a neural model can take many forms. For instance, we might want to know when and how the system relies on sensitive or spurious features, identify systematic errors in its predictions, or learn how to modify the training data and model architecture to improve accuracy and robustness. Today, answering these types of questions often involves significant human effort—researchers must formalize their question, formulate hypotheses about a model’s decision-making process, design datasets on which to evaluate model behavior, then use these datasets to refine and validate hypotheses. As a result, this type of understanding is slow and expensive to obtain, even about the most widely used models.

Automated Interpretability approaches have begun to address the issue of scale. Recently, such approaches have used pretrained language models like GPT-4 (in Bills et al. 2023) or Claude (in Bricken et al. 2023) to generate feature explanations. In earlier work, we introduced MILAN (Hernandez et al. 2022), a captioner model trained on human feature annotations that takes as input a feature visualization and outputs a description of that feature. But automated approaches that use learned models to label features leave something to be desired: they are primarily tools for one-shot hypothesis generation (Huang et al. 2023) rather than causal explanation, they characterize behavior on a limited set of inputs, and they are often low precision.

Our current line of research aims to build tools that help users understand models, while combining the flexibility of human experimentation with the scalability of automated techniques. We introduce the Multimodal Automated Interpretability Agent (MAIA), which designs experiments to answer user queries about components of AI systems. MAIA iteratively generates hypotheses, runs experiments that test these hypotheses, observes experimental outcomes, and updates hypotheses until it can answer the user query. MAIA builds on the Automated Interpretability Agent (AIA) paradigm we introduced in Schwettmann et al. 2023, where an LM-based agent interactively probes systems to explain their behavior. MAIA is equipped with a vision-language model backbone and an API of tools for designing interpretability experiments. With simple modifications to the user query to the agent, the same modular system can field both "macroscopic" questions like identifying systematic biases in model predictions (see the tench example above), as well as "microscopic" questions like describing individual features (see example below).

MAIA

MAIA is a system that uses neural models to automate neural model understanding tasks like feature interpretation and failure mode discovery. It equips a pre-trained vision-language model with a set of tools that support iterative experimentation on subcomponents of other models to explain their behavior. These include tools commonly used by human interpretability researchers: for synthesizing and editing inputs, computing maximally activating exemplars from real-world datasets, and summarizing and describing experimental results. Interpretability experiments proposed by MAIA compose these tools to describe and explain system behavior.

MAIA uses tools to design experiments on other systems

MAIA composes interpretability subroutines into python programs to answer user queries about a system. What kind of experiments does MAIA design? Below we highlight example usage of individual tools to run experiments on neurons inside common vision architectures (CLIP, ResNet, DINO). These are experimental excerpts intended to demonstrate tool use (often, MAIA runs many more experiments to reach its final conclusion!) For full experiment logs, check out our interactive experiment browser.

Visualizing Dataset Exemplars

MAIA uses the dataset_exemplars tool to compute images from the ImageNet dataset that maximally activate a given system (in this case, an individual neuron). The dataset_exemplars tool returns masked versions of the images highlighting image subregions that maximally activate the neuron, as well as the activation value.

Generating Synthetic Test Images

In addition to using real-world stimuli as inputs to the system it is trying to interpret, MAIA can generate additional synthetic inputs that test specific dimensions of a system's selectivity. MAIA uses the text2image function to call a pretrained text-guided diffusion model on prompts it writes. These prompts can test specific hypotheses about the neuron's selectivities, such as in the example of the tennis ball neuron below.

Image editing

MAIA can also call the edit_images tool which uses an text-based image editing module (Instruct Pix2Pix) to make image edits according to prompts written by MAIA. MAIA uses this tool to causally intervene on input space in order to test specific hypotheses about system behavior (e.g. whether the presence of a certain feature is required for the observed behavior!)

Using MAIA to remove spurious features

Learned spurious features impose a challenge when machine learning models are applied in real-world scenarios, where test distributions can differ from training set statistics. We use MAIA to identify and remove learned spurious features inside a classification network (ResNet-18 trained on Spawrious, a synthetically generated dataset involving four dog breeds with different backgrounds). In the train set, each dog breed is spuriously correlated with a certain background (e.g. snow, jungle, desert, beach) while in the test set, the breed-background pairings are scrambled. We use MAIA to find a subset of final layer neurons that robustly predict a single dog breed independently of spurious features, simply by changing the query in the user prompt (see paper for more experimental details). Below, see an example neuron that MAIA determines to be selective for spurious correlations between dog breed and background:

As well as another example neuron that MAIA determines to be selective for a single dog breed, independently of its background:

We then use the features MAIA selects to train an unregularized logistic regression model on the unbalanced data. We find that with no access to unbiased examples, MAIA can identify and remove spurious features, improving model robustness under distribution shift by a wide margin, with an accuracy approaching that of fine-tuning on balanced data (see Section 5.1 of the paper for more results).

Using MAIA to reveal biases

MAIA can also be applied to surface model-level biases. In a preliminary demonstration, we investigate biases in the outputs of an image classifier (ResNet-152) trained on a supervised ImageNet classification task. MAIA is easily adapted to this task by instrumenting an output logit corresponding to a particular image class as the system to be interpreted, and instructing MAIA to determine whether the actual distribution of images that receive high class scores matches the class label (see paper Appendix for full MAIA instructions).

Bias Categories — **MAIA bias detection.** MAIA generates synthetic inputs to surface biases in ResNet-152 output classes. In some cases, MAIA discovers uniform behavior over the inputs (e.g. flagpole).

Validating MAIA explanations

In addition to being useful for applications like spurious feature removal and bias detection, are MAIA descriptions accurate?

We evaluate MAIA on the neuron description paradigm, which appears as a subroutine of many interpretabiltiy procedures (e.g. Bau et al. 2017, Hernandez et al. 2022, Oikarenen & Weng 2022, Bills et al. 2023), and offers common baselines for comparison. Section 4 of the paper shows that MAIA labels outperform baseline methods at predicting behavior of neurons across variety of vision architectures in the wild, and in many cases, are comparable to labels produced by human experts using the MAIA API to run experiments on neurons.

Yet evaluating the absolute accuracy of MAIA's explanations of features in the wild presents a familiar challenge for the field of interpretability: measuring accuracy is difficult when ground truth explanations are unknown. Following the procedure we introduced in FIND (Schwettmann et al. 2023) for validating the performance of interpretability methods on synthetic test systems mimicking real-world behaviors, we construct a set of synthetic vision neurons with known ground-truth selectivity.

We simulate concept detection performed by neurons inside vision models using semantic segmentation. Synthetic neurons are built using an open-set concept detector that combines Grounded DINO (Liu et al., 2023) with SAM (Kirillov et al., 2023) to perform text-guided image segmentation. The ground-truth behavior of each neuron is determined by a text description of the concept(s) the neuron is selective for. MAIA descriptions can thus be compared to ground truth text description of each neuron to evaluate their accuracy. We find that MAIA descriptions match ground truth labels as well as descriptions produced by human experts (see Section 4 of the paper for more details).

Synthetic systems for evaluating automated interpretability methods

The synthetic neurons used to evaluate MAIA are part of a broader effort to construct a testbed of systems with known ground-truth structure that mimic subcomponents of trained networks. Automated Interpretability Agents interactively probe these systems to generate descriptions of their behavior, which are then automatically evaluated. For more details on our FIND (Function Interpretation and Description) benchmark and to download additional test sytems in non-visual domains, see our previous work:

FIND: A Function Description Benchmark for Evaluating Interpretability Methods

Sarah Schwettmann*, Tamar Rott Shaham*, Joanna Materzynska, Neil Chowdhury, Shuang Li, Jacob Andreas, David Bau, Antonio Torralba. NeurIPS 2023.

Paper Dataset Project Page News

BibTeX

@inproceedings{shaham2024multimodal,
      title={A Multimodal Automated Interpretability Agent},
      author={Tamar Rott Shaham and Sarah Schwettmann and Franklin Wang and Achyuta Rajaram and Evan Hernandez and Jacob Andreas and Antonio Torralba},
      year={2024}
      booktitle={Forty-first International Conference on Machine Learning}
    }