By now, ChatGPT, Claude, and other large language models have collected a lot human knowledge that they’re removed from easy answer-generators; they can even express abstract concepts, resembling certain tones, personalities, biases, and moods. Nevertheless, it’s not obvious exactly how these models represent abstract concepts to start with from the knowledge they contain.
Now a team from MIT and the University of California San Diego has developed a option to test whether a big language model (LLM) accommodates hidden biases, personalities, moods, or other abstract concepts. Their method can zero in on connections inside a model that encode for an idea of interest. What’s more, the tactic can then manipulate, or “steer” these connections, to strengthen or weaken the concept in any answer a model is prompted to present.
The team proved their method could quickly root out and steer greater than 500 general concepts in a few of the largest LLMs used today. As an example, the researchers could home in on a model’s representations for personalities resembling “social influencer” and “conspiracy theorist,” and stances resembling “fear of marriage” and “fan of Boston.” They may then tune these representations to reinforce or minimize the concepts in any answers that a model generates.
Within the case of the “conspiracy theorist” concept, the team successfully identified a representation of this idea inside certainly one of the biggest vision language models available today. Once they enhanced the representation, after which prompted the model to elucidate the origins of the famous “Blue Marble” image of Earth taken from Apollo 17, the model generated a solution with the tone and perspective of a conspiracy theorist.
The team acknowledges there are risks to extracting certain concepts, which additionally they illustrate (and caution against). Overall, nonetheless, they see the brand new approach as a option to illuminate hidden concepts and potential vulnerabilities in LLMs, that would then be turned up or right down to improve a model’s safety or enhance its performance.
“What this really says about LLMs is that they’ve these concepts in them, but they’re not all actively exposed,” says Adityanarayanan “Adit” Radhakrishnan, assistant professor of mathematics at MIT. “With our method, there’s ways to extract these different concepts and activate them in ways in which prompting cannot provide you with answers to.”
The team published their findings today in a study appearing within the journal Science. The study’s co-authors include Radhakrishnan, Daniel Beaglehole and Mikhail Belkin of UC San Diego, and Enric Boix-Adserà of the University of Pennsylvania.
A fish in a black box
As use of OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude, and other artificial intelligence assistants has exploded, scientists are racing to know how models represent certain abstract concepts resembling “hallucination” and “deception.” Within the context of an LLM, a hallucination is a response that is fake or accommodates misleading information, which the model has “hallucinated,” or constructed erroneously as fact.
To seek out out whether an idea resembling “hallucination” is encoded in an LLM, scientists have often taken an approach of “unsupervised learning” — a kind of machine learning by which algorithms broadly trawl through unlabeled representations to search out patterns that may relate to an idea resembling “hallucination.” But to Radhakrishnan, such an approach could be too broad and computationally expensive.
“It’s like going fishing with a giant net, attempting to catch one species of fish. You’re gonna get a whole lot of fish that you may have to glance through to search out the suitable one,” he says. “As an alternative, we’re stepping into with bait for the suitable species of fish.”
He and his colleagues had previously developed the beginnings of a more targeted approach with a kind of predictive modeling algorithm often known as a recursive feature machine (RFM). An RFM is designed to directly discover features or patterns inside data by leveraging a mathematical mechanism that neural networks — a broad category of AI models that features LLMs — implicitly use to learn features.
Because the algorithm was an efficient, efficient approach for capturing features basically, the team wondered whether or not they could use it to root out representations of concepts, in LLMs, that are by far essentially the most widely used kind of neural network and maybe the least well-understood.
“We desired to apply our feature learning algorithms to LLMs to, in a targeted way, discover representations of concepts in these large and complicated models,” Radhakrishnan says.
Converging on an idea
The team’s latest approach identifies any concept of interest inside a LLM and “steers” or guides a model’s response based on this idea. The researchers searched for 512 concepts inside five classes: fears (resembling of marriage, insects, and even buttons); experts (social influencer, medievalist); moods (boastful, detachedly amused); a preference for locations (Boston, Kuala Lumpur); and personas (Ada Lovelace, Neil deGrasse Tyson).
The researchers then looked for representations of every concept in several of today’s large language and vision models. They did so by training RFMs to acknowledge numerical patterns in an LLM that would represent a selected concept of interest.
A regular large language model is, broadly, a neural network that takes a natural language prompt, resembling “Why is the sky blue?” and divides the prompt into individual words, each of which is encoded mathematically as a listing, or vector, of numbers. The model takes these vectors through a series of computational layers, creating matrices of many numbers that, throughout each layer, are used to discover other words which can be almost definitely for use to answer the unique prompt. Eventually, the layers converge on a set of numbers that’s decoded back into text, in the shape of a natural language response.
The team’s approach trains RFMs to acknowledge numerical patterns in an LLM that may very well be related to a particular concept. For example, to see whether an LLM accommodates any representation of a “conspiracy theorist,” the researchers would first train the algorithm to discover patterns amongst LLM representations of 100 prompts which can be clearly related to conspiracies, and 100 other prompts that should not. In this fashion, the algorithm would learn patterns related to the conspiracy theorist concept. Then, the researchers can mathematically modulate the activity of the conspiracy theorist concept by perturbing LLM representations with these identified patterns.
The tactic could be applied to look for and manipulate any general concept in an LLM. Amongst many examples, the researchers identified representations and manipulated an LLM to present answers within the tone and perspective of a “conspiracy theorist.” Additionally they identified and enhanced the concept of “anti-refusal,” and showed that whereas normally, a model can be programmed to refuse certain prompts, it as an alternative answered, as an illustration giving instructions on how one can rob a bank.
Radhakrishnan says the approach could be used to quickly seek for and minimize vulnerabilities in LLMs. It may well even be used to reinforce certain traits, personalities, moods, or preferences, resembling emphasizing the concept of “brevity” or “reasoning” in any response an LLM generates. The team has made the tactic’s underlying code publicly available.
“LLMs clearly have a whole lot of these abstract concepts stored inside them, in some representation,” Radhakrishnan says. “There are methods where, if we understand these representations well enough, we are able to construct highly specialized LLMs which can be still protected to make use of but really effective at certain tasks.”
This work was supported, partly, by the National Science Foundation, the Simons Foundation, the TILOS institute, and the U.S. Office of Naval Research.

