We are a team of researchers based out of MIT CSAIL with close collaborators at other institutions (in both academia and industry). We build tools for reverse-engineering AI systems to make them more transparent and steerable. We are particularly focused on Automated Interpretability approaches that use neural models themselves to assist with model understanding tasks.

Recent Publications

MAIA

A Multimodal Automated Interpretability Agent

Tamar Rott Shaham*, Sarah Schwettmann*, Franklin Wang, Achyuta Rajaram, Evan Hernandez, Jacob Andreas, Antonio Torralba. ICML 2024.

An agent that autonomously conducts experiments on other systems to explain their behavior, by composing interpretability subroutines into Python programs.

Paper Project Page Experiment browser

FIND

FIND: A Function Description Benchmark for Evaluating Interpretability Methods

Sarah Schwettmann*, Tamar Rott Shaham*, Joanna Materzynska, Neil Chowdhury, Shuang Li, Jacob Andreas, David Bau, Antonio Torralba. NeurIPS 2023.

An interactive dataset of functions resembling subcomputations inside trained neural networks, for validating and comparing open-ended labeling tools, and a new method that uses Automated Interpretability Agents to explain other systems.

Paper Dataset Project Page News

Visual Circuits

Automatic Discovery of Visual Circuits

Achyuta Rajaram*, Neil Chowdhury*, Antonio Torralba, Jacob Andreas, Sarah Schwettmann. NeurIPS ATTRIB 2023.

We introduce a new technique for automatically discovering subgraphs of vision models that detect concepts.

Paper News

Multimodal neurons

Multimodal Neurons in Pretrained Text-Only Transformers

Sarah Schwettmann*, Neil Chowdhury*, Samuel Klein, David Bau, Antonio Torralba. ICCV CVCL 2023 (Oral).

We find multimodal neurons in a transformer pretrained only on language. When image representations are aligned to the language model, these neurons activate on specific image features and inject related text into the model's next token prediction.

Paper Project Page

MILAN

Natural Language Descriptions of Deep Visual Features

Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, Jacob Andreas. ICLR 2022 (Oral).

We introduce a procedure that automatically labels neurons in deep networks with open-ended, compositional, natural language descriptions of their function.

Paper Project Page News

Recent News

Contact

multimodalinterpretability@mit.edu