{"id":344809,"date":"2026-06-03T00:40:09","date_gmt":"2026-06-02T19:10:09","guid":{"rendered":"https:\/\/ebiztoday.news\/?p=344809"},"modified":"2026-06-03T00:40:09","modified_gmt":"2026-06-02T19:10:09","slug":"recent-microsoft-tool-lets-devs-spin-up-ai-behavior-tests-using-text-descriptions","status":"publish","type":"post","link":"https:\/\/ebiztoday.news\/index.php\/2026\/06\/03\/recent-microsoft-tool-lets-devs-spin-up-ai-behavior-tests-using-text-descriptions\/","title":{"rendered":"Recent Microsoft tool lets devs spin up AI behavior tests using text descriptions"},"content":{"rendered":"<p><\/p>\n<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">AI researchers and labs have advanced by leaps and bounds in evaluating AI models for every thing from <a rel=\"nofollow\" href=\"https:\/\/www.theregister.com\/software\/2024\/12\/05\/mlcommons-produces-benchmark-of-ai-model-safety\/621835\">safety<\/a> and compliance to sycophancy and <a rel=\"nofollow\" href=\"https:\/\/www.anthropic.com\/research\/bloom\">alignment<\/a>. But it surely appears corporations and developers are faced with a brand new, specific need: ensuring that their AI system behaves as intended for his or her specific services or products. <\/p>\n<p class=\"wp-block-paragraph\">In a bid to make that testing process simpler, Microsoft on Tuesday took the wraps off <a rel=\"nofollow\" href=\"https:\/\/github.com\/responsibleai\/ASSERT\">ASSERT<\/a>, short for Adaptive Spec-driven Scoring for Evaluation and Regression Testing. <\/p>\n<p class=\"wp-block-paragraph\">The open-source framework, Microsoft says, makes evaluating application-specific AI behavior easy by utilizing AI to show high-level, natural-language descriptions of goals, policies, or intended behaviors into thorough, scored tests that might be investigated.<\/p>\n<p class=\"wp-block-paragraph\">ASSERT takes plain-language descriptions of an AI model\u2019s expected behavior and policies, turns them right into a structured set of acceptable and unacceptable behaviors, generates problem scenarios and test cases, runs them against the goal system, and scores the outcomes. It might probably also record the paths the AI system takes, including intermediate actions and gear calls, so developers can inspect where failures occur.<\/p>\n<p class=\"wp-block-paragraph\">Devs can provide system context, tools, and constraints, too, in the event that they wish to further customize what the evaluations cover.<\/p>\n<p class=\"wp-block-paragraph\">For instance, a developer could specify that a document research AI agent shouldn\u2019t send emails to people outside the corporate, limit confidential information to C-level executives, and supply concise summaries with prior context in mind. ASSERT will use those rules to generate test cases that check whether the system follows those rules on an ongoing basis.<\/p>\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><figcaption class=\"wp-element-caption\"><span class=\"wp-block-image__credits\"><strong>Image Credits:<\/strong>Microsoft<\/span><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The framework, in keeping with Microsoft, fills a spot that broader, more general evaluations cannot when AI models are intended to behave in a fashion that is formed by an application or product\u2019s context, policies, and tools.<\/p>\n<p class=\"wp-block-paragraph\">\u201cOne in all the things we\u2019ve learned is that evaluations are absolutely critical to creating good decisions,\u201d said <a href=\"https:\/\/www.linkedin.com\/in\/slbird\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Sarah Bird<\/a>, chief product officer of Responsible AI at Microsoft. \u201cBecause should you don\u2019t understand the behavior of the AI system, it\u2019s really hard to know if it\u2019s meeting your organization\u2019s bar [\u2026] What we found is that should you really need to have a trustworthy system, you need to evaluate many more dimensions which might be application-specific.\u201d<\/p>\n<p class=\"wp-block-paragraph\">Bird said ASSERT might be used to guage systems after they\u2019re being built, after deployment, and even for continuous monitoring. <\/p>\n<p class=\"wp-block-paragraph\">The discharge comes amidst a gradual but broader shift within the AI industry. As models grow more capable, researchers are specializing in repeatable testing and regression checks, with <a rel=\"nofollow\" href=\"https:\/\/crfm.stanford.edu\/helm\/\">Stanford\u2019s HELM<\/a>, <a rel=\"nofollow\" href=\"https:\/\/mlcommons.org\/ailuminate\/\">MLCommons\u2019 AILuminate<\/a>, and evaluation groups like <a rel=\"nofollow\" href=\"https:\/\/metr.org\/\">METR<\/a> rolling out benchmarks to measure how models behave under different conditions.<\/p>\n<\/div>\n<p><em>While you purchase through links in our articles, we may earn a small commission. This doesn\u2019t affect our editorial independence.<\/em><\/p>\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>AI researchers and labs have advanced by leaps and bounds in evaluating AI models for every thing from safety and compliance to sycophancy and alignment. But it surely appears corporations and developers are faced with a brand new, specific need: ensuring that their AI system behaves as intended for his or her specific services or [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":344810,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10],"tags":[2547,3719,7596,4715,92,5434,5008,352,855],"class_list":["post-344809","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology","tag-behavior","tag-descriptions","tag-devs","tag-lets","tag-microsoft","tag-spin","tag-tests","tag-text","tag-tool"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/ebiztoday.news\/index.php\/wp-json\/wp\/v2\/posts\/344809","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ebiztoday.news\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ebiztoday.news\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ebiztoday.news\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ebiztoday.news\/index.php\/wp-json\/wp\/v2\/comments?post=344809"}],"version-history":[{"count":2,"href":"https:\/\/ebiztoday.news\/index.php\/wp-json\/wp\/v2\/posts\/344809\/revisions"}],"predecessor-version":[{"id":344812,"href":"https:\/\/ebiztoday.news\/index.php\/wp-json\/wp\/v2\/posts\/344809\/revisions\/344812"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ebiztoday.news\/index.php\/wp-json\/wp\/v2\/media\/344810"}],"wp:attachment":[{"href":"https:\/\/ebiztoday.news\/index.php\/wp-json\/wp\/v2\/media?parent=344809"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ebiztoday.news\/index.php\/wp-json\/wp\/v2\/categories?post=344809"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ebiztoday.news\/index.php\/wp-json\/wp\/v2\/tags?post=344809"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}