Recent secret math benchmark stumps AI models and PhDs alike

Epoch AI allowed Fields Medal winners Terence Tao and Timothy Gowers to review portions of the benchmark. “These are extremely difficult,” Tao said in feedback provided to Epoch. “I feel that within the near term principally the one method to solve them, in need of having an actual domain expert in the world, is by a mix of a semi-expert like a graduate student in a related field, possibly paired with some combination of a contemporary AI and a lot of other algebra packages.”

A chart showing AI models’ limited success on the FrontierMath problems, taken from Epoch AI’s research paper.

Credit:

Epoch AI

To assist within the verification of correct answers during testing, the FrontierMath problems should have answers that might be mechanically checked through computation, either as exact integers or mathematical objects. The designers made problems “guessproof” by requiring large numerical answers or complex mathematical solutions, with lower than a 1 percent likelihood of correct random guesses.

Mathematician Evan Chen, writing on his blog, explained how he thinks that FrontierMath differs from traditional math competitions just like the International Mathematical Olympiad (IMO). Problems in that competition typically require creative insight while avoiding complex implementation and specialized knowledge, he says. But for FrontierMath, “they keep the primary requirement, but outright invert the second and third requirement,” Chen wrote.

While IMO problems avoid specialized knowledge and complicated calculations, FrontierMath embraces them. “Because an AI system has vastly greater computational power, it’s actually possible to design problems with easily verifiable solutions using the identical concept that IOI or Project Euler does—principally, ‘write a proof’ is replaced by ‘implement an algorithm in code,'” Chen explained.

The organization plans regular evaluations of AI models against the benchmark while expanding its problem set. They are saying they are going to release additional sample problems in the approaching months to assist the research community test their systems.