Researchers Break Open AI’s Black Box—and Use What They Find Inside to Control It

The inner workings of huge AI systems remain largely opaque, raising significant safety and trust issues. Researchers have now developed a method to extract and manipulate the interior concepts governing model behavior, providing a brand new method to understand and steer their activity.

Modern AI models are marvels of engineering, but even their creators remain at midnight about how they represent knowledge internally. Because of this subtle shifts in prompting can produce surprisingly different outputs. Simply asking a model to indicate its work before answering often improves accuracy, while certain deliberately malicious prompts can override built-in safety features.

This has motivated significant research geared toward teasing out the patterns of activity in these models’ neural networks that correspond to specific concepts. Investigators hope to make use of these methods to higher understand why models behave certain ways and potentially modify their behavior on the fly.

Now researchers have unveiled an efficient latest way of extracting concepts from models that works across language, reasoning, and vision algorithms. In a paper in Science, the researchers used these concepts to each monitor and effectively steer model behavior.

“Our results illustrate the facility of internal representations for advancing AI safety and model capabilities,” the authors write. “We showed how these representations enabled model steering, through which we exposed vulnerabilities and improved model capabilities.”

Key to the team’s approach is a brand new algorithm called the Recursive Feature Machine (RFM). They trained the algorithm on pairs of prompts—some containing an idea of interest, others not—after which identified patterns of activity within the model’s neural network tracking each concept.

This enables the algorithm to learn “concept vectors”—essentially patterns of activity that nudge the model within the direction of a particular concept. The vectors will be used to change the model’s internal processes when it’s generating an output to steer it toward or away from specific concepts or behaviors.

To check the approach, the researchers asked GPT-4o to provide 512 concepts across five concept classes and generate training data on each. They extracted concept vectors from the info and used the vectors to steer the behavior of several large AI models.

The approach worked well across a broad range of model types, including large language models, vision-language models, and reasoning models. Surprisingly, they found newer, larger, and better-performing models were actually more steerable than some smaller ones.

Crucially, the team showed they might use the technique to show and address serious vulnerabilities within the models. In a single test, they created a vector for the concept of “anti-refusal,” which allowed them to bypass built-in safety features in vision-language models to stop them from giving advice on how take drugs. But additionally they learned a vector for “anti-deception,” which they successfully used to steer a model away from giving misleading answers.

One in every of the study’s more interesting findings was that the extracted features were transferable across languages. An idea vector learned with English training data may very well be used to change outputs in other languages. The researchers also found they might mix multiple concept vectors to govern model behavior in additional sophisticated ways.

But the brand new technique’s real power is in its efficiency. It took fewer than 500 training samples and lower than a minute of processing time on a single Nvidia A100 GPU to discover activity patterns related to an idea and steer towards it.

The researchers say this might not only make it possible to systematically map concepts inside large AI models, nevertheless it could also result in more efficient ways of tweaking model behavior after training in comparison with existing methods.

The approach remains to be a great distance from delivering complete model transparency. But it surely’s a useful addition within the growing arsenal of model evaluation tools that may turn into increasingly essential as AI pushes deeper into all of our lives.

Related Post

Leave a Reply