Open-Weight vs Closed Frontier Models: What the Bill Actually Shows

An engineer on a five-person product team opened the Slack thread at 9 a.m. on a Tuesday: "Our Claude API bill was $4,200 last month and we haven't shipped the feature yet." Two hours later, they'd swapped their agentic coding pipeline to DeepSeek V4 Flash. The bill estimate for the same workload: roughly $90.

That's not a hypothetical. V4 Flash on a single A100 is documented to saturate a five-developer coding team , and DeepSeek made its 75% discount permanent in May 2026, landing V4 Pro at roughly 34x cheaper on input and 86x cheaper on output than Claude Opus 4.7 . The performance story has changed fast enough that the question "open-weight or closed?" now deserves a real answer, not a reflex.

What the benchmark gap actually looks like right now

The short version: narrower than most teams' mental model, but not gone.

The open-source model field closed the coding benchmark gap with frontier closed-source models to within 2-3 percentage points in the first four months of 2026, while maintaining a 6-7x price advantage on output tokens.

DeepSeek V4 Pro set the open-weight ceiling with a score of 80.6% on SWE-bench Verified, matching GPT-5.5-class agentic performance.

GLM 5.2, released in mid-June, is breaking through on planning quality and long-horizon coding specifically

it's a 744-billion-parameter MoE model that became the first open-weight model to top the Artificial Analysis Intelligence Index, placing fifth overall against the closed frontier.

The gap that remains is real and worth naming honestly. On the Artificial Analysis Intelligence Index, GPT-5.5 at 60 and Claude Opus 4.7 at 57 are meaningfully ahead of any open-weight model. On SWE-Bench Pro, Opus 4.7's 64.3% lead over Kimi K2.6's 58.6% reflects genuine capability on the hardest real-world coding tasks.

Closed models - Claude Opus 4.6, GPT-5.4 Pro, and Gemini 3.1 Pro - retain a meaningful lead on reasoning-heavy benchmarks like GPQA Diamond and Humanity's Last Exam, typically by 3-8 percentage points.

That spread sounds small until you map it to actual work. For agents that handle highly variable inputs - customer support, research synthesis, creative problem-solving - frontier closed-source models still have an edge in handling unexpected situations gracefully. The gap shows up most clearly when agents encounter edge cases outside their training distribution.

Where the cost math actually flips

The conversation about open-weight costs usually skips the most important variable: volume.

At low and moderate volume, the closed API is cheaper, and it is not close. Someone else bought the GPUs, amortizes idle time across thousands of customers, and charges per token. Self-hosting means renting or buying GPUs that cost the same whether saturated or idle, plus the engineering to run them.

At DeepSeek's API rates ($0.14/M input, $0.28/M output), break-even with reserved instances only arrives around 3-4 billion tokens per day. Self-host for sovereignty or fine-tuning reasons rather than raw cost.

This is a point most posts about "open-weight is cheaper" quietly skip. If your team processes ten million tokens a day, you probably don't self-host DeepSeek V4 Pro to save money. You do it because you need the weights on your infrastructure, or because the MIT license lets you fine-tune the router. The API rate is already cheap enough that the managed route wins on economics for most teams until volume gets very high and very steady.

The architecture of the new open-weight models makes the managed route more attractive than it used to be. A cache hit on V4 Pro costs $0.003625 per million tokens - 120x less than its cache-miss rate. Agentic coding loops that resend the same system prompt and file context on every turn see most of their input tokens land in cache, so effective per-session cost runs far below the list rate.

The decision is per-workload, not per-company

F5's 2026 research found organizations running or evaluating an average of seven models, with 78% operating some inference themselves. That's not confusion. That's sensible portfolio management.

The practical decision tree has three branches, and the branches are independent per workload:

Does the task sit at the hard frontier? On many common enterprise tasks - coding, text classification, summarization, structured data extraction, instruction following - the best open-weight models now perform comparably to GPT-4o and Claude Sonnet. On the most complex reasoning tasks and in long agentic workflows, closed frontier models still hold an edge. If the task is classification or extraction, the hard frontier doesn't matter. If it's a novel multi-agent research workflow, it might.

Is the workload regulated? HIPAA, PCI, SOC 2 Type II, and EU AI Act alignment all work better with closed-source vendors that publish formal compliance documentation. Open-weight deployments inherit compliance responsibility directly. For regulated clients, this alone can make closed source the right default regardless of cost. The EU AI Act becomes fully applicable on 2 August 2026, which is making this question more urgent for European teams.

Do you need sovereignty or version stability? This is where open weights genuinely win on something other than price. This represents a move away from "renting" intelligence through public APIs toward owning the full stack. When models are released with open weights, enterprises gain the ability to budget and deploy them within air-gapped environments. A fine-tune you built last quarter stays put; an API provider's model does not.

One caution worth adding: the license is not the same thing as "open-source." Read every weight license before procurement sign-off. Apache 2.0 is the exception, not the rule, at the open-weight frontier. DeepSeek V4 is MIT, which is genuinely permissive. Several other models in the top tier are not.

What mid-June's model landscape actually looks like

The open-weight field moved fast enough in Q2 2026 that anyone making a decision from a January 2026 mental model is making it on stale data. There were roughly twelve substantive frontier releases in Q1 2026 alone - double Q4 2025.

DeepSeek V4 Flash is the first open-weight model that teams dropped into real agentic pipelines as a plausible substitute for an Anthropic- or OpenAI-class frontier model. The larger V4 Pro variant set the ceiling with 80.6% on SWE-bench Verified. But it is Flash that broke through, because it captures most of that capability at a price that sits on the Pareto frontier of performance and cost.

Artificial Analysis has GLM 5.2 as the number-one open-weight model on its Intelligence Index at 51, ahead of Nemotron 3 Ultra (48), MiniMax M3 (44), DeepSeek V4 Pro (44), and Kimi K2.6 (43), and just five points below Claude Fable 5.

The gap to the closed frontier is real but narrow, and it has not been widening.

A tool like Beagle - connecting Slack conversations to your workflows - increasingly sits at the top of pipelines where the model-routing decision below it is consequential. Which model handles the triage pass versus which handles the synthesis step isn't a one-time call. It's a workload map you revisit quarterly, because the models your team chose in January are not the same as the ones available now.

The engineer who swapped to V4 Flash in that Tuesday Slack thread didn't close a philosophical debate about open vs. closed. They made one workload decision with the numbers in front of them. That's the right unit of analysis.

What the benchmark gap actually looks like right now

GLM 5.2, released in mid-June, is breaking through on planning quality and long-horizon coding specifically

Where the cost math actually flips

The decision is per-workload, not per-company

What mid-June's model landscape actually looks like

Keep reading

Run GLM 5.2 Before You Renew That Claude Contract

Understand the Nous Research Hermes Open-Weight Model Line

Pick an Open-Weight Model Based on What the Gap Data Shows