🎙️ Episode 31606:17June 26, 2026

FrontierMath v2: 42% of Math Problems Had Errors

Listen to this episode

AI-generated discussion by Alex and Jamie

About this episode

Join hosts Alex and Jamie in this episode of Nerd Level Tech AI Cast as they unravel the surprising flaws in Epoch AI's FrontierMath v2 update, where a staggering 42% of problems were found to contain errors. Discover the implications of this revelation for AI math benchmarks and learn why FrontierMath is considered the ultimate test for AI models, pushing them to their limits like never before. Get ready for a mix of humor and insight as they explore what went wrong and what it means for the future of AI intelligence!

Transcript

[Alex]: Welcome back, everyone, to another episode of Nerd Level Tech AI Cast—the only show where math errors make us question the meaning of life, the universe, and everything. I’m Alex.

[Jamie]: And I’m Jamie. And Alex, today’s episode is basically about the math equivalent of finding out your calculator’s been gaslighting you for two years. [laughs] We’re talking about Epoch AI’s FrontierMath v2 update—and how, uh, 42% of its problems had errors?

[Alex]: That’s right. Forty-two percent! Which, by the way, is a suspiciously Douglas Adams number, but I’m pretty sure this wasn’t a prank by towel-wielding mathematicians.

[Jamie]: You sure? Because if I wrote math problems for a living, I’d totally sneak in a “42” somewhere.

[Alex]: I’d expect nothing less from you. [PAUSE] So, today we’re diving into what broke in one of the most important AI math benchmarks, how it was fixed, and what it means if you actually care about which AI model’s the “smartest” at math.

[Jamie]: So, Alex—let’s start at the top. What exactly *is* FrontierMath? And why was everyone trusting it in the first place?

[Alex]: Great question. So, FrontierMath is basically the Olympics of AI math benchmarks—hundreds of original, expert-crafted problems, designed by actual mathematicians, Olympiad question writers, and even Fields medalists. It launched back in November 2024, and it’s tough. Like, “math PhD cries into their coffee” tough.

[Jamie]: Okay, so not your basic “what’s two plus two” stuff. Got it.

[Alex]: Not even close. Problems cover everything from computational number theory and real analysis to algebraic geometry and category theory. And the problems are new and unpublished, so models can’t cheat by memorizing answers from the internet.

[Jamie]: So, like, the SAT for AIs, but on expert mode. [laughs] How did models actually do on it?

[Alex]: When it launched, the top AI models—GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet—were solving less than 2% of the problems. That’s compared to, like, 90% on easier benchmarks. So it was a real challenge, which is what made it so valuable for testing progress.

[Jamie]: Okay, so this benchmark is supposed to be the gold standard. But now we find out, two years later, that 42% of the problems had errors? What kind of mistakes are we talking about?

[Alex]: Mostly the kind of errors that haunt every math student’s nightmares: off-by-one slip-ups, flipped signs, or a typo in the final answer. Imagine spending hours solving a problem, only for the answer key to say you’re wrong because of a minus sign gone rogue.

[Jamie]: So it’s not that the *problems* were unsolvable—just that the answers were... off?

[Alex]: Exactly. And since FrontierMath grades by exact match—your answer has to match the stored answer perfectly—any tiny error in the answer key means every correct model got unfairly marked wrong.

[Jamie]: I feel so seen right now. I once lost points on a math test for writing “-5” instead of “5.” Still bitter.

[Alex]: FrontierMath would’ve docked you, too. But a handful of problems were actually ambiguous—meaning, the way they were written, there could be more than one defensible answer. Those tended to get removed altogether.

[Jamie]: So, how did they even discover all these mistakes? Was someone just sitting there, redoing 350 insane math problems?

[Alex]: You’re going to love this: the audit started after OpenAI flagged suspicious errors in their own internal review. Then, Epoch AI did something ironic—they used the very AI models the benchmark was designed to test, like GPT-5.5 and Claude Opus 4.7, to scan for suspect problems.

[Jamie]: Wait, so the students graded the teacher?

[Alex]: Pretty much! The models flagged problems that looked off, and then human mathematicians double-checked. Turns out, AI is now good enough at math to help catch errors in the tests used to measure its own abilities.

[Jamie]: Self-improving AI audit squad. We’re officially in the future.

[Alex]: So after all these fixes, what happened to the scores? Did every model suddenly look like a genius?

[Jamie]: Yeah, did anyone go from “C student” to “valedictorian” overnight?

[Alex]: Not quite. Scores did rise a few points across the board—so models got credit for problems they’d actually solved—but the relative rankings barely changed. The top models—like GPT-5.5 Pro and Claude Fable 5—are now both hovering around 87% on the main set. But here’s the kicker: they’re so close together, a handful of answer key mistakes could flip who’s “best.”

[Jamie]: So if you were making business decisions based on “winner by half a point”—maybe don’t?

[Alex]: Exactly. Treat those tiny differences as ties, especially when the error bars overlap. And always check which version of the benchmark was used.

[Jamie]: That’s a good point. If you’re picking an AI model for your company, you want to know the test wasn’t secretly broken.

[Alex]: And here’s the thing—none of this makes FrontierMath a bad benchmark. In fact, Epoch’s transparency is textbook. They published every correction, versioned the dataset, and kept the old version available for comparison. That’s the gold standard for AI benchmarks.

[Jamie]: So, the takeaway is: trust, but verify. And double-check the answer key.

[Alex]: And if you’re ever benchmarking your own models, always look for published changelogs and version numbers. If the numbers quietly change in the night—run.

[Jamie]: Or at least send in your own squad of AI auditors.

[Alex]: [laughs] Exactly.

[Jamie]: Alright, that’s a wrap for today’s episode! If you enjoyed us geeking out over math errors and AI benchmarks, be sure to subscribe and leave us a review.

[Alex]: And remember—next time someone claims their model is “state-of-the-art,” ask them which version of the test they used. Or just send them this episode.

[Jamie]: We’ll be back soon with more nerd-level deep dives. Thanks for listening to the Nerd Level Tech AI Cast!

[Alex]: See you next time! [Outro music fades out]