Researchers astonished by tool’s apparent success at revealing AI’s “hidden objectives”

Blind auditing reveals “hidden objectives”

To check how effectively these hidden objectives may very well be uncovered, Anthropic arrange a “blind auditing” experiment. 4 independent research teams tried to detect a model’s hidden, score-maximizing motivations without knowing the way it had been trained. Three teams successfully identified the hidden “RM-sycophancy” objective—named for the model’s desire to match reward systems—even pinpointing the precise biases the AI exploited.

But probably the most surprising revelation got here from exploiting what the researchers termed “an interesting quirk” of contemporary AI: multiple “personas” running concurrently on the identical neural network, as expressed through role-play attributable to different prompts. While the model’s “assistant” persona fastidiously concealed its intentions—following explicit training to maintain quiet—the model could adopt a “user” persona when prompted in another way. On this less-restricted context, the model sometimes posed questions or made statements that inadvertently revealed its hidden goals.

In other words, while the “assistant” persona knew not to confess it was gaming the reward model scoring system, its “user” persona was not explicitly trained to cover these motives, and thus inadvertently revealed the model’s hidden objectives.

Potential advances in interpretability

Anthropic also employed neural network interpretability methods, specifically, sparse autoencoders (SAEs), which may highlight neural network activity patterns linked to deceptive behavior. “Interpretability” in AI refers to techniques designed to make the interior workings and decisions of AI models more comprehensible to humans. Using this system, researchers could rapidly trace odd outputs back to specific patterns related to hidden objectives.

The researchers were “astonished by the success” of the SAEs at revealing hidden knowledge at first, however the researchers later noted that a few of this success could potentially be explained by simpler semantic search methods. The brand new interpretability methods they developed show promise but remain under ongoing investigation.

This research highlights a possible limitation of current AI “safety” evaluations, which regularly assess only surface-level behavior, in accordance with the study. “If AI systems can appear well-behaved while harboring secret motives, we will not depend on this kind of surface-level safety testing eternally,” the researchers concluded.