Were the AI models wrong, or was the benchmark wrong?

The benchmark's answer key was wrong on the affected problems. Most errors were calculation slips made when authors recorded the final answer, so correct model responses were graded as incorrect. 2

How were the errors discovered?

An audit that began in April 2026, prompted by an OpenAI internal review, used GPT-5.5 and Claude Opus 4.7 to flag suspect problems, which human mathematicians then verified. 2

Did the rankings change after the fix?

Scores rose a few points across the board, but model rankings stayed broadly the same. The top models on Tiers 1-3 are now within overlapping error margins of each other. 2 4

How hard was FrontierMath originally?

At its November 2024 launch, leading models solved less than 2% of problems — far below their 90%+ scores on older math tests like GSM-8K and MATH. 3

Is FrontierMath still trustworthy?

Yes, arguably more so. Epoch published the corrections in detail, versioned the dataset, and kept v1 available — the kind of transparency that makes a benchmark more reliable, not less. 1

ai-ml

FrontierMath v2: 42% of Math Problems Had Errors

June 26, 2026

#frontiermath #ai benchmarks #epoch ai #llm evaluation #math reasoning #benchmark reliability

FrontierMath v2: 42% of Math Problems Had Errors

On June 12, 2026, Epoch AI shipped FrontierMath v2 — an update that addressed errors in 42% of the benchmark's problems, correcting 135 and removing 12. FrontierMath is one of the hardest math tests in AI, and since its 2024 debut its scores have anchored "state-of-the-art" reasoning claims. v2 is a reminder that even a flagship benchmark can be wrong about a large share of its own questions.

TL;DR

What happened: Epoch AI released FrontierMath v2 on June 12, 2026, after an audit found errors affecting 42% of problems.¹
The numbers: v2 corrected 123 problems in Tiers 1-3 and 12 in Tier 4, and removed 5 from Tiers 1-3 and 7 from Tier 4. The dataset now holds 338 problems — a 295-problem base set (Tiers 1-3) plus a 43-problem Tier 4 expansion.¹
What broke: Most errors were simple calculation slips made while the author extracted the final answer (off-by-one mistakes, flipped signs). A handful of problem statements were fatally ambiguous.²
How it was caught: The audit started in April 2026 after OpenAI flagged more errors than expected in an internal review. Epoch used GPT-5.5 and Claude Opus 4.7 to surface suspect problems, then had mathematicians confirm them.²
The impact: Scores rose a few points across the board, but model rankings stayed broadly the same.²
Why it matters: Every "best on FrontierMath" claim made before June 12 was scored against a test wrong about two in five of its own questions.

What You'll Learn

What FrontierMath is and why it became a reference benchmark
Exactly what changed in the v2 update
The kinds of errors that slipped through expert review
How the audit found them — and the role AI models played
How the leaderboard moved after the fix
What the episode means for anyone who picks models based on benchmark numbers

What is FrontierMath?

FrontierMath is a benchmark of hundreds of original, expert-crafted mathematics problems built by Epoch AI to measure advanced mathematical reasoning. It launched on November 8, 2024, developed with more than 60 mathematicians, including professors, International Mathematical Olympiad question writers, and Fields medalists.³ The problems span computational number theory, real analysis, algebraic geometry, and category theory, and a typical one takes a specialist researcher hours — sometimes days — to solve.³

Two design choices made it credible. First, the problems are novel and unpublished, so a model cannot have memorized the answer from training data. Second, each answer is a large number or a complex mathematical object that is checked automatically by running a Python answer() function, with a deliberately "guessproof" design: less than a 1% chance of guessing correctly without doing the math.³

When it launched, FrontierMath was brutal. Leading models of the day — Claude 3.5 Sonnet, OpenAI's o1-preview, GPT-4o, and Gemini 1.5 Pro — each solved less than 2% of problems, against 90%+ scores on older tests like GSM-8K and MATH.³ That gap is exactly what made the benchmark useful: it had headroom no other math test offered.

What changed in FrontierMath v2

The v2 update, published on Epoch's benchmark hub on June 12, 2026, reworked a large share of the dataset.¹ Here is the precise accounting from Epoch's own changelog:

Change	Tiers 1-3	Tier 4	Total
Problems corrected	123	12	135
Problems removed	5	7	12

After the cleanup, the full dataset is 338 problems: a 295-problem base set Epoch calls Tiers 1-3, plus a 43-problem expansion of exceptionally hard problems called Tier 4.¹ Twelve problems are public (ten from Tiers 1-3 and two from Tier 4); the rest stay private to limit contamination.¹ Counting corrections and removals together, the update touched 147 of the 350 problems in the prior version (v1) — the 42% figure Epoch cites.¹

This is not the first time Epoch acknowledged the risk. At launch in 2024, a second review of a random subsample estimated that about 1 in 20 problems (~5%) contained errors, which Epoch noted was comparable to error rates in benchmarks like ImageNet.³ The v2 audit ending up touching 42% shows how far a small, plausible-sounding error estimate can be from the result of a deep, dedicated review.

What kind of errors were found

The errors were not exotic. According to Epoch's writeup, the vast majority were simple calculation mistakes that crept in when the problem author was extracting the final answer — the off-by-one and flipped-sign class of slip that any working mathematician recognizes.² Because FrontierMath grades on an exact match between the model's submitted object and a single stored answer, one wrong digit in that stored answer silently marks every correct model as wrong.

A smaller set of problems had statements that were fatally ambiguous — phrased in a way that admitted more than one defensible answer, which makes automated grading meaningless.² Those were the problems most likely to be removed rather than corrected.

The takeaway is subtle but important: the math behind the problems was sound. What failed was the bookkeeping around the answer key, and that is precisely the layer that automated, exact-match grading is least able to catch on its own.

How the errors were caught

The audit began in April 2026, after OpenAI told Epoch it had found more errors than expected during an internal review of the benchmark.² OpenAI funded FrontierMath's creation and has exclusive access to a subset of it, so it has a close working view of the questions.¹

Epoch's process leaned on the very models the benchmark is meant to grade. The team ran GPT-5.5 and Claude Opus 4.7 over the dataset to flag problems that looked suspect, then handed those flags to human mathematicians for review.² Almost all of the flagged items turned out to be genuine, severe errors — the kind that made a problem impossible to solve as written.² In other words, frontier models were good enough to help find the holes in a test designed to expose their limits.

How the leaderboard moved

Cleaning the answer key lifted scores, because corrected problems that models had actually solved now count in their favor. Epoch reports that scores rose across the board while rankings stayed broadly intact.²

The current top of the table is close. According to the LM Council aggregator — a secondary tracker, not Epoch's official ranking — GPT-5.5 Pro and Claude Fable 5 both sit around 87% on Tiers 1-3, a gap well inside their overlapping error margins.⁴ Treat any single figure as approximate: even within that same LM Council table, the non-Pro GPT-5.5 (xhigh) variant scores closer to 85%, a reminder that model variant matters as much as model name. The robust, source-backed story is the trajectory — from under 2% at launch in 2024 to the high 80s on Tiers 1-3 in 2026.³⁴

That arc is genuinely striking. It is also exactly why the accuracy of the underlying answer key matters so much: when models cluster within a point or two of each other near the top, a handful of mis-keyed answers can reorder the leaderboard.

What this means for trusting AI benchmarks

Engineering teams pick models, justify budgets, and commit to architectures partly on benchmark numbers. FrontierMath v2 is a clean case study in why those numbers deserve a second look.

Three lessons stand out. First, a published error rate is a floor, not a ceiling. Epoch's careful launch-time estimate of ~5% sat far below the 42% the v2 audit ended up touching — not because the team was careless, but because a light subsample and a deep audit are different instruments.²³ Second, exact-match grading hides answer-key bugs. When a single stored value decides right or wrong, a typo in that value penalizes every correct model invisibly. Third, margins matter more than rank. A model that "leads" by half a point on a test with known answer-key errors has not really been shown to lead at all.

None of this makes FrontierMath a bad benchmark — the opposite. Epoch found the errors, published the corrections problem-by-problem, versioned the dataset, and kept v1 available for comparison.¹ That transparency is the model other benchmark maintainers should copy. If you cite benchmark scores in a decision, the practical move is to note the benchmark version and date, treat sub-margin gaps as ties, and prefer benchmarks that publish a changelog over those that quietly overwrite their numbers.

For more on how benchmark integrity can break down in subtler ways, see our analysis of the DeepSWE benchmark and contamination concerns. For context on the models topping FrontierMath today, see our coverage of Claude Opus 4.8's launch and Claude Fable 5.

The Bottom Line

FrontierMath v2 did not break the benchmark — it repaired it, in the open. But the size of the fix is the story: a flagship test built with Fields medalists and automated verification still needed errors addressed in 42% of its problems.¹ The next time a model launch leans on a single benchmark number, ask which version it was scored against, how wide the error bars are, and whether the maintainer publishes a changelog. On FrontierMath, those questions now have clear answers.

Sources

Epoch AI, "FrontierMath Tiers 1-3 (v2)" — benchmark hub, changelog, and dataset composition. https://epoch.ai/benchmarks/frontiermath-tiers-1-3-v2 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹
Analysis of the FrontierMath v2 error correction, error types, and audit process. https://www.digitalapplied.com/blog/epoch-frontiermath-v2-error-corrected-ai-benchmark-analysis ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³
Epoch AI, "FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI" (Nov. 8, 2024). https://epoch.ai/frontiermath/the-benchmark ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
LM Council benchmark aggregator — FrontierMath leaderboard (secondary source). https://lmcouncil.ai/benchmarks ↩ ↩² ↩³

Frequently Asked Questions

FrontierMath v2 is the corrected version of Epoch AI's FrontierMath math-reasoning benchmark, released June 12, 2026. It addressed errors in 42% of problems, correcting 135 and removing 12, leaving 338 problems in total. 1