AI researchers and labs have advanced by leaps and bounds in evaluating AI models for every thing from safety and compliance to sycophancy and alignment. But it surely appears corporations and developers are faced with a brand new, specific need: ensuring that their AI system behaves as intended for his or her specific services or products.
In a bid to make that testing process simpler, Microsoft on Tuesday took the wraps off ASSERT, short for Adaptive Spec-driven Scoring for Evaluation and Regression Testing.
The open-source framework, Microsoft says, makes evaluating application-specific AI behavior easy by utilizing AI to show high-level, natural-language descriptions of goals, policies, or intended behaviors into thorough, scored tests that might be investigated.
ASSERT takes plain-language descriptions of an AI model’s expected behavior and policies, turns them right into a structured set of acceptable and unacceptable behaviors, generates problem scenarios and test cases, runs them against the goal system, and scores the outcomes. It might probably also record the paths the AI system takes, including intermediate actions and gear calls, so developers can inspect where failures occur.
Devs can provide system context, tools, and constraints, too, in the event that they wish to further customize what the evaluations cover.
For instance, a developer could specify that a document research AI agent shouldn’t send emails to people outside the corporate, limit confidential information to C-level executives, and supply concise summaries with prior context in mind. ASSERT will use those rules to generate test cases that check whether the system follows those rules on an ongoing basis.
The framework, in keeping with Microsoft, fills a spot that broader, more general evaluations cannot when AI models are intended to behave in a fashion that is formed by an application or product’s context, policies, and tools.
“One in all the things we’ve learned is that evaluations are absolutely critical to creating good decisions,” said Sarah Bird, chief product officer of Responsible AI at Microsoft. “Because should you don’t understand the behavior of the AI system, it’s really hard to know if it’s meeting your organization’s bar […] What we found is that should you really need to have a trustworthy system, you need to evaluate many more dimensions which might be application-specific.”
Bird said ASSERT might be used to guage systems after they’re being built, after deployment, and even for continuous monitoring.
The discharge comes amidst a gradual but broader shift within the AI industry. As models grow more capable, researchers are specializing in repeatable testing and regression checks, with Stanford’s HELM, MLCommons’ AILuminate, and evaluation groups like METR rolling out benchmarks to measure how models behave under different conditions.
While you purchase through links in our articles, we may earn a small commission. This doesn’t affect our editorial independence.

