Every yr, the nations competing in the International Mathematical Olympiad (IMO) arrive with a booklet of their greatest, most authentic issues. Those booklets get shared amongst delegations, then quietly disappear. No one had ever collected them systematically, cleaned them, and made them obtainable, not for AI researchers testing the limits of mathematical reasoning, and not for the college students round the world coaching for these competitions largely on their very own.
Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), King Abdullah University of Science and Technology (KAUST), and the firm HUMAIN have now carried out precisely that.
MathNet is the largest high-quality dataset of proof-based math issues ever created. Comprising greater than 30,000 expert-authored issues and options spanning 47 nations, 17 languages, and 143 competitions, it is 5 instances bigger than the next-biggest dataset of its type. The work shall be introduced at the International Conference on Learning Representations (ICLR) in Brazil later this month.
What makes MathNet completely different isn’t solely its dimension, however its breadth. Previous Olympiad-level datasets draw virtually completely from competitions in the United States and China. MathNet spans dozens of nations throughout six continents, covers 17 languages, consists of each text- and image-based issues and options, and spans 4 a long time of competitors arithmetic. The objective is to seize the full vary of mathematical views and problem-solving traditions that exist throughout the international math neighborhood, not simply the most seen ones.
“Every country brings a booklet of its most novel and most creative problems,” says Shaden Alshammari, an MIT PhD scholar and lead creator on the paper. “They share the booklets with each other, but no one had made the effort to collect them, clean them, and upload them online.”
Building MathNet required monitoring down 1,595 PDF volumes totaling greater than 25,000 pages, spanning digital paperwork and decades-old scans in additional than a dozen languages. A good portion of that archive got here from an unlikely supply: Navid Safaei, a longtime IMO neighborhood determine and co-author who had been accumulating and scanning these booklets by hand since 2006. His private archive fashioned a lot of the spine of the dataset.
The sourcing issues as a lot as the scale. Where most present math datasets pull issues from neighborhood boards like Art of Problem Solving (AoPS), MathNet attracts completely from official nationwide competitors booklets. The options in these booklets are expert-written and peer-reviewed, and they usually run to a number of pages, with authors strolling by a number of approaches to the identical downside. That depth provides AI fashions a far richer sign for studying mathematical reasoning than the shorter, casual options typical of community-sourced datasets. It additionally means the dataset is genuinely helpful for college students: Anyone making ready for the IMO or a nationwide competitors now has entry to a centralized, searchable collection of high-quality issues and labored options from traditions round the world.
“I remember so many students for whom it was an individual effort. No one in their country was training them for this kind of competition,” says Alshammari, who competed in the IMO as a scholar herself. “We hope this gives them a centralized place with high-quality problems and solutions to learn from.”
The workforce has deep roots in the IMO neighborhood. Sultan Albarakati, a co-author, at present serves on the IMO board, and the researchers are working to share the dataset with the IMO basis straight. To validate the dataset, they assembled a grading group of greater than 30 human evaluators from nations together with Armenia, Russia, Ukraine, Vietnam, and Poland, who coordinated collectively to confirm 1000’s of options.
“The MathNet database has the potential to be an excellent resource for both students and leaders seeking new problems to work on or looking for the solution to a difficult question,” says Tanish Patil, deputy chief of Switzerland’s IMO. “Whilst other archives of Olympiad problems do exist (notably, the Contest Collections forums on AoPS), these resources lack standardized formatting system, verified solutions, and important problem metadata that topics and theory require. It will also be interesting to see how this dataset is used to improve the performance of reasoning models, and if we will soon be able to reliably answer an important issue when creating novel Olympiad questions: determining if a problem is truly original.”
MathNet additionally features as a rigorous benchmark for AI efficiency, and the outcomes reveal a extra sophisticated image than latest headlines about AI math prowess may recommend. Frontier fashions have made extraordinary progress: Some have reportedly achieved gold-medal efficiency at the IMO, and on normal benchmarks they now resolve issues that may stump most people. But MathNet reveals that progress is uneven. Even GPT-5, the top-performing mannequin examined, averaged round 69.3 p.c on MathNet’s predominant benchmark of 6,400 issues, failing practically one-in-three Olympiad-level issues. And when issues embody figures, efficiency drops considerably throughout the board, exposing visible reasoning as a constant weak level for even the most succesful fashions.
Several open-source fashions scored 0 p.c on Mongolian-language issues, highlighting one other dimension the place present AI programs fall quick regardless of their general energy.
“GPT models are equally good in English and other languages,” Alshammari says. “But many of the open-source models fail completely at less-common languages, such as Mongolian.”
The variety of MathNet can be designed to tackle a deeper limitation in how AI fashions be taught arithmetic. When coaching knowledge skews towards English and Chinese issues, fashions soak up a slender slice of mathematical tradition. A Romanian combinatorics downside or a Brazilian quantity concept downside could strategy the identical underlying idea from a very completely different angle. Exposure to that vary, the researchers argue, makes each people and AI programs higher mathematical thinkers.
Beyond problem-solving, MathNet introduces a retrieval benchmark that asks whether or not fashions can acknowledge when two issues share the identical underlying mathematical construction, a functionality that issues each for AI growth and for the math neighborhood itself. Near-duplicate issues have appeared in actual IMO exams over the years as a result of discovering mathematical equivalences throughout completely different notations, languages, and codecs is genuinely laborious, even for knowledgeable human committees. Testing eight state-of-the-art embedding fashions, the researchers discovered that even the strongest recognized the appropriate match solely about 5 p.c of the time on the first strive, with fashions often rating structurally unrelated issues as extra comparable than equal ones.
The dataset additionally features a retrieval-augmented technology benchmark, testing whether or not giving a mannequin a structurally associated downside earlier than asking it to resolve a brand new one improves efficiency. It does, however solely when the retrieved downside is genuinely related. DeepSeek-V3.2-Speciale gained up to 12 proportion factors with well-matched retrieval, whereas irrelevant retrieval degraded efficiency in roughly 22 p.c of circumstances.
Alshammari wrote the paper with Safaei, HUMAIN AI engineer Abrar Zainal, KAUST Academy Director Sultan Albarakati, and MIT CSAIL colleagues: grasp’s scholar Kevin Wen SB ’25; Microsoft Principal Engineering Manager Mark Hamilton SM ’22, PhD ‘25; and professors William Freeman and Antonio Torralba. Their work was funded, partially, by the Schwarzman College of Computing Fellowship and the National Science Foundation.
MathNet is publicly obtainable at mathnet.csail.mit.edu.