The Gap Between Vendor Scores and Standardized Ones

Three numbers all claim to be the best current SWE-bench Pro score. As of early June: 80.3% for Claude Fable 5 on Anthropic's own scaffold, 59.1% for GPT-5.4 on Scale's standardized SEAL leaderboard, and 47.1% for Claude Opus 4.6 on Scale's private commercial set. All three are real. The spread comes from scaffolding and data splits, and most pages quoting a score never say which one they mean.

That variance is the actual story. Not whether model A beats model B, but whether the number you are reading reflects a controlled, reproducible setup or a vendor's best-case harness on a public dataset the model may have seen before.

The contamination problem is not hypothetical

In February 2026, OpenAI published an analysis explaining why SWE-bench Verified no longer measures frontier coding progress. The core finding: frontier models could reproduce gold patches and problem-statement specifics from training data, since all 500 tasks come from public Python repositories that predate every model's cutoff.

SWE-bench Pro was introduced partly to fix this. It splits tasks across three tiers: public repositories, held-out private repositories reserved to prevent overfitting, and commercial codebases sourced from real startups. The commercial tier is never released publicly; only evaluation results are shared. This makes memorization far less effective and improves confidence that scores reflect reasoning, not recall.

The practical tell: on the harder SWE-bench Pro, all models drop roughly 20 points compared to their Verified scores - showing how much of that "Verified" performance was benchmark-specific. A model with a 20-point gap between its Verified and Pro scores is not worse than you thought. It was always that good. The benchmark was just easier than it looked.

The score on the press release is usually the vendor's best harness on the easiest data split.

The judge problem compounds it

There is a second layer. When LLM judges grade outputs rather than deterministic test runners, an LLM judge is likely to favour outputs from its own model family. One team ran the same benchmark with three different judges and found that one model swung 47 percentage points on a single skill depending on who graded it.

So two things are stacked: contaminated inputs on one end, biased grading on the other. The number in the middle is doing a lot of quiet work to look clean.

The steelman for leaderboards

None of this means benchmarks are useless. Public leaderboards record how AI capabilities evolve over time , and that longitudinal signal is genuinely valuable - not for picking a model today, but for understanding whether the category is improving. The trajectory is clear: benchmarks are becoming harder, more realistic, and harder to game. SWE-bench Pro, DeepSWE, and SWE Atlas represent real methodological progress. If you ignore leaderboards entirely, you lose a rough but honest signal about the direction of the field.

The problem is not the benchmarks. It is using a single headline number as a purchasing decision.

What to read instead

Practical reading order for a model decision: commercial-set score first (closest to private-codebase reality), public SEAL score second (clean cross-model comparison), vendor numbers last.

A model that scores 80% on Verified but 50% on Pro is benchmark-overfit. A model that scores 85% on Verified and 60% on Pro is more trustworthy. The ratio matters more than either number in isolation.

A model ranked first on a leaderboard may be many times more expensive per token than the model at fourth place, and for most production workloads the price-performance frontier matters more than the raw capability ranking. Read benchmark results alongside a cost-per-token comparison and a reasoning-effort versus quality breakdown, since extended-thinking modes inflate scores while inflating cost and latency in lockstep.

There is also a harder question underneath all of this. METR published an analysis finding that many SWE-bench-passing pull requests would not survive human code review. Their time-horizon analysis showed that models perform well on short tasks but degrade sharply on longer ones - suggesting the benchmark overweights quick fixes and underweights the architectural reasoning that defines senior engineering work.

The coding benchmarks that teams cite most are still largely benchmarks of patch-writing, not of sustained multi-session reasoning across a real codebase. An agent that resolves 60% of isolated GitHub issues may still struggle to hold the shape of a system across a week of changes. A teammate like Beagle lives in that longer time horizon - not in individual issue resolution, but in the connective tissue between tasks - and no current public benchmark measures that well.

The honest thing to say is that the number the vendor put in the subject line of the announcement email tells you something, just not what most teams think it tells them. Check the split. Check who held the stopwatch. Then run it on something that actually looks like your code.

The contamination problem is not hypothetical

The judge problem compounds it

The steelman for leaderboards

What to read instead

Keep reading

Your team picked a model. The benchmark was already broken.

How LLM Evals Work, From Test Case to Grade

What Do AI Evals Actually Measure?