As artificial intelligence models develop into increasingly prevalent and are integrated into diverse sectors like health care, finance, education, transportation, and entertainment, understanding how they work under the hood is critical. Interpreting the mechanisms underlying AI models enables us to audit them for safety and biases, with the potential to deepen our understanding of the science behind intelligence itself.
Imagine if we could directly investigate the human brain by manipulating each of its individual neurons to look at their roles in perceiving a specific object. While such an experiment can be prohibitively invasive within the human brain, it’s more feasible in one other form of neural network: one which is artificial. Nonetheless, somewhat just like the human brain, artificial models containing thousands and thousands of neurons are too large and complicated to check by hand, making interpretability at scale a really difficult task.
To deal with this, MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers decided to take an automatic approach to interpreting artificial vision models that evaluate different properties of images. They developed “MAIA” (Multimodal Automated Interpretability Agent), a system that automates a wide range of neural network interpretability tasks using a vision-language model backbone equipped with tools for experimenting on other AI systems.
“Our goal is to create an AI researcher that may conduct interpretability experiments autonomously. Existing automated interpretability methods merely label or visualize data in a one-shot process. However, MAIA can generate hypotheses, design experiments to check them, and refine its understanding through iterative evaluation,” says Tamar Rott Shaham, an MIT electrical engineering and computer science (EECS) postdoc at CSAIL and co-author on a brand new paper concerning the research. “By combining a pre-trained vision-language model with a library of interpretability tools, our multimodal method can reply to user queries by composing and running targeted experiments on specific models, constantly refining its approach until it might provide a comprehensive answer.”
The automated agent is demonstrated to tackle three key tasks: It labels individual components inside vision models and describes the visual concepts that activate them, it cleans up image classifiers by removing irrelevant features to make them more robust to latest situations, and it hunts for hidden biases in AI systems to assist uncover potential fairness issues of their outputs. “But a key advantage of a system like MAIA is its flexibility,” says Sarah Schwettmann PhD ’21, a research scientist at CSAIL and co-lead of the research. “We demonstrated MAIA’s usefulness on a number of specific tasks, but provided that the system is built from a foundation model with broad reasoning capabilities, it might answer many differing types of interpretability queries from users, and design experiments on the fly to research them.”
Neuron by neuron
In a single example task, a human user asks MAIA to explain the concepts that a specific neuron inside a vision model is answerable for detecting. To analyze this query, MAIA first uses a tool that retrieves “dataset exemplars” from the ImageNet dataset, which maximally activate the neuron. For this instance neuron, those images show people in formal attire, and closeups of their chins and necks. MAIA makes various hypotheses for what drives the neuron’s activity: facial expressions, chins, or neckties. MAIA then uses its tools to design experiments to check each hypothesis individually by generating and editing synthetic images — in a single experiment, adding a bow tie to a picture of a human face increases the neuron’s response. “This approach allows us to find out the particular explanation for the neuron’s activity, very similar to an actual scientific experiment,” says Rott Shaham.
MAIA’s explanations of neuron behaviors are evaluated in two key ways. First, synthetic systems with known ground-truth behaviors are used to evaluate the accuracy of MAIA’s interpretations. Second, for “real” neurons inside trained AI systems with no ground-truth descriptions, the authors design a brand new automated evaluation protocol that measures how well MAIA’s descriptions predict neuron behavior on unseen data.
The CSAIL-led method outperformed baseline methods describing individual neurons in a wide range of vision models equivalent to ResNet, CLIP, and the vision transformer DINO. MAIA also performed well on the brand new dataset of synthetic neurons with known ground-truth descriptions. For each the true and artificial systems, the descriptions were often on par with descriptions written by human experts.
How are descriptions of AI system components, like individual neurons, useful? “Understanding and localizing behaviors inside large AI systems is a key a part of auditing these systems for safety before they’re deployed — in a few of our experiments, we show how MAIA could be used to search out neurons with unwanted behaviors and take away these behaviors from a model,” says Schwettmann. “We’re constructing toward a more resilient AI ecosystem where tools for understanding and monitoring AI systems keep pace with system scaling, enabling us to research and hopefully understand unexpected challenges introduced by latest models.”
Peeking inside neural networks
The nascent field of interpretability is maturing into a definite research area alongside the rise of “black box” machine learning models. How can researchers crack open these models and understand how they work?
Current methods for peeking inside are inclined to be limited either in scale or within the precision of the reasons they’ll produce. Furthermore, existing methods are inclined to fit a specific model and a selected task. This caused the researchers to ask: How can we construct a generic system to assist users answer interpretability questions on AI models while combining the flexibleness of human experimentation with the scalability of automated techniques?
One critical area they wanted this method to deal with was bias. To find out whether image classifiers displayed bias against particular subcategories of images, the team checked out the ultimate layer of the classification stream (in a system designed to sort or label items, very similar to a machine that identifies whether a photograph is of a dog, cat, or bird) and the probability scores of input images (confidence levels that the machine assigns to its guesses). To grasp potential biases in image classification, MAIA was asked to search out a subset of images in specific classes (for instance “labrador retriever”) that were more likely to be incorrectly labeled by the system. In this instance, MAIA found that images of black labradors were more likely to be misclassified, suggesting a bias within the model toward yellow-furred retrievers.
Since MAIA relies on external tools to design experiments, its performance is restricted by the standard of those tools. But, as the standard of tools like image synthesis models improve, so will MAIA. MAIA also shows confirmation bias at times, where it sometimes incorrectly confirms its initial hypothesis. To mitigate this, the researchers built an image-to-text tool, which uses a unique instance of the language model to summarize experimental results. One other failure mode is overfitting to a specific experiment, where the model sometimes makes premature conclusions based on minimal evidence.
“I believe a natural next step for our lab is to maneuver beyond artificial systems and apply similar experiments to human perception,” says Rott Shaham. “Testing this has traditionally required manually designing and testing stimuli, which is labor-intensive. With our agent, we are able to scale up this process, designing and testing quite a few stimuli concurrently. This may also allow us to check human visual perception with artificial systems.”
“Understanding neural networks is difficult for humans because they’ve lots of of 1000’s of neurons, each with complex behavior patterns. MAIA helps to bridge this by developing AI agents that may robotically analyze these neurons and report distilled findings back to humans in a digestible way,” says Jacob Steinhardt, assistant professor on the University of California at Berkeley, who wasn’t involved within the research. “Scaling these methods up may very well be probably the most essential routes to understanding and safely overseeing AI systems.”
Rott Shaham and Schwettmann are joined by five fellow CSAIL affiliates on the paper: undergraduate student Franklin Wang; incoming MIT student Achyuta Rajaram; EECS PhD student Evan Hernandez SM ’22; and EECS professors Jacob Andreas and Antonio Torralba. Their work was supported, partly, by the MIT-IBM Watson AI Lab, Open Philanthropy, Hyundai Motor Co., the Army Research Laboratory, Intel, the National Science Foundation, the Zuckerman STEM Leadership Program, and the Viterbi Fellowship. The researchers’ findings might be presented on the International Conference on Machine Learning this week.