Read the FrontierCode Score Before You Ship Agent-Written Code

The bar for passing a coding benchmark is low. Write a patch. Make the tests go green. Move on. Nobody asks whether the patch is idiomatic, whether it introduces scope creep, or whether the team lead would reject it on Friday afternoon.

Cognition's FrontierCode, released on June 8, finally asks that second question at scale.

What the benchmark measures

Most coding benchmarks, SWE-Bench being the famous one, test whether an agent can complete an isolated task and produce functionally correct output. The agent gets a repo and an issue, writes a patch, and we check whether the tests pass. Useful, but incomplete.

FrontierCode grades on end-to-end code quality: correctness, test quality, scope discipline, style, and adherence to codebase standards. It doesn't care whether CI is green if the implementation is idiomatic garbage. The benchmark has 150 tasks drawn from 36 open-source repositories, with a harder 50-task subset called Diamond. Each task took 40+ hours of work from actual open-source maintainers, and grading covers behavioral correctness, regression safety, scope creep, and code quality - not just whether tests pass.

While other benchmarks generated issues from single PRs via programmatic scraping, FrontierCode is hand-selected by repo maintainers from multi-PR chains and freeform requests. That distinction matters. It means the tasks look like real feature requests and refactoring asks - the messy, under-specified work that actually lands in backlogs - not tidy self-contained bugs.

The grading problem they solved

Getting rubric-based grading to be reliable is hard. Asking a model to judge another model's code introduces all kinds of drift. FrontierCode employs a novel ensemble of grading techniques, including traditional unit tests, rubrics for subjective quality assessment, and new types of verifiers designed to catch subtle errors and stylistic issues. Cognition implemented an extensive quality control pipeline, including adversarial testing and multi-stage manual reviews by researchers.

The result: FrontierCode produces 81% fewer misclassification errors than SWE-Bench Pro.

That last number is the one to sit with. It implies that a large fraction of passing grades on the dominant existing benchmark correspond to patches that wouldn't survive a real code review. METR experiments found that high-scoring models on these benchmarks often produce patches that wouldn't be accepted by human maintainers.

Where others grade like a CI, FrontierCode grades like a tech lead.

The actual scores

SOTA LLMs have significant room for improvement, with the top model earning a score of just 13.4/100 on the Diamond task set.

On FrontierCode Main and Extended, Opus 4.8 still maintains a clear lead, at 34.3% and 51.8%, respectively. So on easier tasks, the models are doing something real. On the hardest work - the kind of multi-file, cross-cutting changes that require knowing what the codebase actually values - the frontier stalls out around 13%.

There's also a large gap between open-source models and the frontier. Kimi K2.6, the best-performing open-source model, achieves just 3.8% on Diamond, 16% on Main, and 37% on Extended.

What's genuinely new here

The question FrontierCode asks is not new. Developers have complained for years that AI-generated code passes tests while failing reviews. What's new is someone building a rigorous eval instrument around it.

There's a useful framing: three eras of AI coding benchmarks - 2021, autocomplete and HumanEval; 2023, passing tests via SWE-Bench; 2026, maintainable code via FrontierCode. Whether that framing holds depends on how widely the benchmark gets adopted, but the underlying logic is sound. The eval is measuring something previous evals were ignoring.

One detail worth flagging: running FrontierCode against older models shows that the easiest third of tasks were rapidly solved over late 2025. Claude Opus nearly doubled its pass rate on those tasks - from 41% to 74% in about four months. That matches the general sense among developers that something shifted around December 2025, when agentic coding loops started feeling feasible rather than fragile. The harder tiers haven't moved much yet.

That's a useful signal for anyone planning agent-assisted engineering work. The straightforward stuff - isolated bugs, well-scoped refactors - is increasingly tractable. The nuanced, judgment-heavy work is still mostly human territory.

The legitimate caveats

This benchmark has a conflict of interest baked in. Cognition makes Devin, an autonomous coding agent. A benchmark that makes all current models look inadequate serves their commercial narrative. That doesn't make the methodology wrong, but it's worth keeping in mind when reading their analysis.

The tasks aren't public, which limits independent verification. Scores reflect a model-plus-harness combination, not the model alone. And because grading involves rubric-based prompts rather than pure automated checks, there's some subjectivity baked in.

Cognition currently has no plans to release the tasks publicly, to avoid contamination - but they are opening up evaluation access to all model creators. That's a reasonable tradeoff for longevity, but it means external researchers can't easily reproduce the numbers. Developers in the community have already raised questions about variance and reproducibility across model temperature settings, and the Cognition team has responded, though the data shared so far is thin.

The benchmark matters because it names a failure mode that has been hiding in plain sight. A teammate like Beagle can surface the right context for an engineer reviewing agent output - the prior decisions, the style guidelines, the constraints - but the underlying model still has to produce code worth merging. Right now, even the best models clear that bar only about half the time on moderately hard tasks, and barely at all on genuinely difficult ones.

That's not a reason to stop using coding agents. It is a reason to stop reading SWE-Bench scores as though they answer the question you actually care about.

What the benchmark measures

The grading problem they solved

The actual scores

What's genuinely new here

The legitimate caveats

Keep reading

How Do You Actually Run an Incident Channel in Slack?

A2A Protocol Reached v1.0 - Here Is What It Actually Does

The Trouble With AI Benchmark Scores as Procurement Signals