We are a team of researchers based out of MIT CSAIL with close collaborators at other institutions (in both academia and industry). We build tools for reverse-engineering AI systems to make them more transparent and steerable. We are particularly focused on Automated Interpretability approaches that use neural models themselves to assist with model understanding tasks.
Tamar Rott Shaham*, Sarah Schwettmann*, Franklin Wang, Achyuta Rajaram, Evan Hernandez, Jacob Andreas, Antonio Torralba. ICML 2024.
An agent that autonomously conducts experiments on other systems to explain their behavior, by composing interpretability subroutines into Python programs.
Sarah Schwettmann*, Tamar Rott Shaham*, Joanna Materzynska, Neil Chowdhury, Shuang Li, Jacob Andreas, David Bau, Antonio Torralba. NeurIPS 2023.
An interactive dataset of functions resembling subcomputations inside trained neural networks, for validating and comparing open-ended labeling tools, and a new method that uses Automated Interpretability Agents to explain other systems.
Sarah Schwettmann*, Neil Chowdhury*, Samuel Klein, David Bau, Antonio Torralba. ICCV CVCL 2023 (Oral).
We find multimodal neurons in a transformer pretrained only on language. When image representations are aligned to the language model, these neurons activate on specific image features and inject related text into the model's next token prediction.
Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, Jacob Andreas. ICLR 2022 (Oral).
We introduce a procedure that automatically labels neurons in deep networks with open-ended, compositional, natural language descriptions of their function.