Yearly, the countries competing within the International Mathematical Olympiad (IMO) arrive with a booklet of their best, most original problems. Those booklets get shared amongst delegations, then quietly disappear. Nobody had ever collected them systematically, cleaned them, and made them available, not for AI researchers testing the bounds of mathematical reasoning, and never for the scholars world wide training for these competitions largely on their very own.
Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), King Abdullah University of Science and Technology (KAUST), and the corporate HUMAIN have now done exactly that.
MathNet is the most important high-quality dataset of proof-based math problems ever created. Comprising greater than 30,000 expert-authored problems and solutions spanning 47 countries, 17 languages, and 143 competitions, it’s five times larger than the next-biggest dataset of its kind. The work will likely be presented on the International Conference on Learning Representations (ICLR) in Brazil later this month.
What makes MathNet different is just not only its size, but its breadth. Previous Olympiad-level datasets draw almost exclusively from competitions in america and China. MathNet spans dozens of nations across six continents, covers 17 languages, includes each text- and image-based problems and solutions, and spans 4 many years of competition mathematics. The goal is to capture the total range of mathematical perspectives and problem-solving traditions that exist across the worldwide math community, not only essentially the most visible ones.
“Every country brings a booklet of its most novel and most creative problems,” says Shaden Alshammari, an MIT PhD student and lead creator on the paper. “They share the booklets with one another, but nobody had made the trouble to gather them, clean them, and upload them online.”
Constructing MathNet required tracking down 1,595 PDF volumes totaling greater than 25,000 pages, spanning digital documents and decades-old scans in greater than a dozen languages. A good portion of that archive got here from an unlikely source: Navid Safaei, a longtime IMO community figure and co-author who had been collecting and scanning those booklets by hand since 2006. His personal archive formed much of the backbone of the dataset.
The sourcing matters as much as the dimensions. Where most existing math datasets pull problems from community forums like Art of Problem Solving (AoPS), MathNet draws exclusively from official national competition booklets. The solutions in those booklets are expert-written and peer-reviewed, and so they often run to multiple pages, with authors walking through several approaches to the identical problem. That depth gives AI models a far richer signal for learning mathematical reasoning than the shorter, informal solutions typical of community-sourced datasets. It also means the dataset is genuinely useful for college students: Anyone preparing for the IMO or a national competition now has access to a centralized, searchable collection of high-quality problems and worked solutions from traditions world wide.
“I remember so many students for whom it was a person effort. Nobody of their country was training them for this type of competition,” says Alshammari, who competed within the IMO as a student herself. “We hope this offers them a centralized place with high-quality problems and solutions to learn from.”
The team has deep roots within the IMO community. Sultan Albarakati, a co-author, currently serves on the IMO board, and the researchers are working to share the dataset with the IMO foundation directly. To validate the dataset, they assembled a grading group of greater than 30 human evaluators from countries including Armenia, Russia, Ukraine, Vietnam, and Poland, who coordinated together to confirm 1000’s of solutions.
“The MathNet database has the potential to be a wonderful resource for each students and leaders looking for latest problems to work on or on the lookout for the answer to a difficult query,” says Tanish Patil, deputy leader of Switzerland’s IMO. “Whilst other archives of Olympiad problems do exist (notably, the Contest Collections forums on AoPS), these resources lack standardized formatting system, verified solutions, and essential problem metadata that topics and theory require. It’s going to even be interesting to see how this dataset is used to enhance the performance of reasoning models, and if we’ll soon give you the chance to reliably answer a very important issue when creating novel Olympiad questions: determining if an issue is actually original.”
MathNet also functions as a rigorous benchmark for AI performance, and the outcomes reveal a more complicated picture than recent headlines about AI math prowess might suggest. Frontier models have made extraordinary progress: Some have reportedly achieved gold-medal performance on the IMO, and on standard benchmarks they now solve problems that will stump most humans. But MathNet shows that progress is uneven. Even GPT-5, the top-performing model tested, averaged around 69.3 percent on MathNet’s primary benchmark of 6,400 problems, failing nearly one-in-three Olympiad-level problems. And when problems include figures, performance drops significantly across the board, exposing visual reasoning as a consistent weak point for even essentially the most capable models.
Several open-source models scored 0 percent on Mongolian-language problems, highlighting one other dimension where current AI systems fall short despite their overall strength.
“GPT models are equally good in English and other languages,” Alshammari says. “But lots of the open-source models fail completely at less-common languages, resembling Mongolian.”
The variety of MathNet can also be designed to deal with a deeper limitation in how AI models learn mathematics. When training data skews toward English and Chinese problems, models absorb a narrow slice of mathematical culture. A Romanian combinatorics problem or a Brazilian number theory problem may approach the identical underlying concept from a very different angle. Exposure to that range, the researchers argue, makes each humans and AI systems higher mathematical thinkers.
Beyond problem-solving, MathNet introduces a retrieval benchmark that asks whether models can recognize when two problems share the identical underlying mathematical structure, a capability that matters each for AI development and for the maths community itself. Near-duplicate problems have appeared in real IMO exams over time because finding mathematical equivalences across different notations, languages, and formats is genuinely hard, even for expert human committees. Testing eight state-of-the-art embedding models, the researchers found that even the strongest identified the right match only about 5 percent of the time on the primary try, with models ceaselessly rating structurally unrelated problems as more similar than equivalent ones.
The dataset also features a retrieval-augmented generation benchmark, testing whether giving a model a structurally related problem before asking it to unravel a brand new one improves performance. It does, but only when the retrieved problem is genuinely relevant. DeepSeek-V3.2-Speciale gained as much as 12 percentage points with well-matched retrieval, while irrelevant retrieval degraded performance in roughly 22 percent of cases.
Alshammari wrote the paper with Safaei, HUMAIN AI engineer Abrar Zainal, KAUST Academy Director Sultan Albarakati, and MIT CSAIL colleagues: master’s student Kevin Wen SB ’25; Microsoft Principal Engineering Manager Mark Hamilton SM ’22, PhD ‘25; and professors William Freeman and Antonio Torralba. Their work was funded, partly, by the Schwarzman College of Computing Fellowship and the National Science Foundation.
MathNet is publicly available at mathnet.csail.mit.edu.

