Humanity’s Last Exam Stumps Top AI Models—and That’s a Good Thing

How do you translate a Roman inscription found on a tombstone? What number of pairs of tendons are supported by one bone in hummingbirds? Here’s a chemical response that requires three steps: What are they? Based on the newest research on Tiberian pronunciation, discover all syllables ending in a consonant sound from this Hebrew text.

These are only just a few example questions from the newest try to measure the aptitude of enormous language models. These algorithms power ChatGPT and Gemini. They’re getting “smarter” in specific domains—math, biology, medicine, programming—and developing a kind of common sense.

Just like the dreaded standardized tests we endured in class, researchers have long relied on benchmarks to trace AI performance. But as cutting-edge algorithms now often rating over 90 percent on such tests, older benchmarks are increasingly becoming obsolete.

A global team has now developed a form of recent SAT for language models. Dubbed Humanity’s Last Exam (HLE), the test has 2,500 difficult questions spanning math, the humanities, and the natural sciences. A human expert crafted and thoroughly vetted each query so the answers are non-ambiguous and might’t be easily found online.

Although the test captures some general reasoning in models, it measures task performance not  “intelligence.” The exam focuses on expert-level academic problems, that are a far cry from the messy scenarios and decisions we face day by day. But as AI increasingly floods many research fields, the HLE benchmark is an objective approach to measure their improvement.

“HLE little question offers a useful window into today’s AI expertise,” wrote MIT’s Katherine Collins and Joshua Tenenbaum, who weren’t involved within the study. “But it surely is not at all the last word on humanity’s considering or AI’s capability to contribute to it.”

Moving Scale

Evidently AI has steadily turn into smarter over the past few years. But what exactly does “smart” mean for an algorithm?

A typical approach to measure AI “smarts” is to challenge different AI models—or upgraded versions of the identical model—with standardized benchmarks. These collections of questions cover a big selection of topics and might’t be answered with an easy web search. They require each an intensive representation of the world, and more importantly, the power to make use of it to reply questions. It’s like taking a driver’s license test: You’ll be able to memorize the whole handbook of rules and regulations but still must determine who has the fitting of way in any scenario.

Nevertheless, benchmarks are only useful in the event that they still stump AI. And the models have turn into expert test takers. Cutting-edge large language models are posting near-perfect scores across benchmarks tests, making the tests less effective at detecting real advances.

The issue “has grown worse because in addition to being trained on the whole web, current AI systems can often seek for information online throughout the test,” essentially learning to cheat, wrote Collins and Tenenbaum.

Working with the non-profit Center for AI Safety and Scale AI, the HLE Contributors Consortium designed a brand new benchmark tailor-made to confuse AI. They asked hundreds of experts from 50 countries to submit graduate-level questions in specific fields. The questions have two kinds of answers. One type must completely match the actual solution, while the opposite is multiple-choice. This makes it easy to mechanically rating test results.

Notably, the team avoided incorporating questions requiring longer or open-ended answers, resembling writing a scientific paper, a law temporary, or other cases where there isn’t a clearly correct answer or a approach to gauge if a solution is correct.

They selected questions in a multi-step process to gauge difficulty and originality. Roughly 70,000 submissions were tested on multiple AI models. Only those who stumped models advanced to the subsequent stage, where experts judged their usefulness for AI evaluation using strict guidelines.

The team has released 2,500 questions from the HLE collection. They’ve kept the remaining private to stop AI systems from gaming the test and outperforming on questions they’ve seen before.

When the team first released the test in early 2025, leading AI models from Google, OpenAI, and Anthropic scored in the one digits. Because it subsequently caught the attention of AI corporations, many adopted the test to show the performance of latest releases. Newer algorithms have shown some improvement, though even leading models still struggle. OpenAI’s GTP-4o scored a measly 2.7 percent, whereas GPT-5’s success rate increased to 25 percent.

A Recent Standard?

Like IQ tests and standardized college admission exams, HLE has come under fire. Some people object to the test’s bombastic name, which could lead on most people to misunderstand an AI’s capabilities in comparison with human experts.

Others query what the test actually measures. Expertise across a big selection of educational fields and model improvement are obvious answers. Nevertheless, HLE’s current curation inherently limits “probably the most difficult and meaningful questions that human experts engage with,” which require thoughtful responses, often across disciplines, that may hardly be captured with short answers or multiple-choice questions, wrote Collins and Tenenbaum.

Expertise also involves excess of answering existing questions. Beyond solving a given problem, experts may also evaluate whether the query is sensible—for instance, if it has answers the test-maker didn’t consider—and gauge how confident they’re of their answers.

“Humanity shouldn’t be contained in any static test, but in our ability to repeatedly evolve each in asking and answering questions we never, in our wildest dreams, thought we might—generation after generation,” Subbarao Kambhampati, former president of the Association for the Advancement of Artificial Intelligence, who was not involved within the study, wrote on X.

And although a rise in HLE rating could possibly be as a consequence of fundamental advances in a model, it may be because model-makers gave an algorithm extra training on the general public dataset—like studying the previous yr’s exam questions before a test. On this case, the exam mainly reflects the AI’s test performance, not that it has gained expertise or “intelligence.”

The HLE team embraces these criticisms and are continuing to enhance the benchmark. Others are developing completely different scales. Using human tests to benchmark AI has been the norm, but researchers are looking into other ways that might higher capture an AI’s scientific creativity or collaborative considering with humans in the true world. A consensus on AI intelligence, and the way to measure it, stays a hot topic for debate.

Despite its shortcomings, HLE is a useful approach to measure AI expertise. But looking forward, “because the authors note, their project will ideally make itself obsolete by forcing the event of progressive paradigms for AI evaluation,” wrote Collins and Tenenbaum.

Related Post

Leave a Reply