Four months. That is the average lag between the best open-weight models and the closed-source frontier, according to Epoch AI's capability analysis published May 29. Since January 2026, the most capable open-weight models have trailed frontier closed models by an average of four months in the Epoch Capabilities Index - an 8-point gap, roughly equivalent to the difference between GPT-5 and GPT-5.5. That is not a large gap. The mistake most teams make is either ignoring it entirely or treating it as an absolute veto on open weights. Neither is right.
Four open-weight models released between February and April 2026 now score 50 or above on the Artificial Analysis Intelligence Index. A year ago, the best open-weight models were scoring in the low 30s. The capability story has changed faster than most teams have updated their infrastructure decisions.
So the question worth answering precisely is: where does the gap actually bite you, and where is it noise?
Where the closed-model edge is still real
The gap concentrates in a specific kind of work. It shows up in long-horizon agentic reasoning - tasks that run twenty, thirty, or more dependent steps, where an early mistake has to be caught and corrected rather than assumed away. On those workloads the frontier closed models still hold a measurable lead, keeping coherence and backtracking more reliably across a long chain.
Reasoning-heavy benchmarks like GPQA Diamond, Humanity's Last Exam, and frontier math still show closed models - Claude Opus 4.6, GPT-5.4 Pro, and Gemini 3.1 Pro Deep Think - retaining a meaningful lead, typically by 3-8 percentage points.
There is also a subtler issue worth knowing: two factors mean the measured gap may actually understate the true one. Evidence suggests open-weight models tend to perform worse on private benchmarks compared to closed models, plausibly because they more aggressively hillclimb on public benchmarks. And Epoch notes that closed labs likely understate the capability gap by keeping their most advanced models private before release. The most capable closed-weight model publicly available may not be the most capable model in development. Open-weight models are compared against the published frontier. The unpublished frontier may be wider.
That is not a reason to dismiss open weights. It is a reason to be honest about what you are comparing against.
Where open-weight models are genuinely good enough
For classification, extraction, summarization, retrieval-augmented question answering, and most coding, a good open-weight model in 2026 is not a compromise. It does the job.
The coding gap has effectively closed. MiMo V2 Pro, MiniMax M2.7, and DeepSeek V3.2 now sit within striking distance of Opus 4.6 on real-world coding workloads, with MiniMax M2.7 specifically costing roughly 50x less per million output tokens.
The cost differential is not a rounding error. Open models close 70-90% of the capability gap at 5-10× lower per-token inference cost. At any real usage volume, that arithmetic changes the decision entirely for workloads where the best model is not required.
The right architecture in 2026 routes by task: closed-source for the fraction of requests where frontier capability is actually necessary, open-weight for the rest.
The 2026 strategic question is rarely "one or the other" - most production AI deployments use closed models for general-purpose tasks and fine-tuned open models for cost-sensitive or domain-specific workflows.
The current open-weight tier worth knowing about
The field has consolidated around a small number of genuinely frontier-adjacent open-weight models.
GLM-5.2 from Z.ai scores 51 on the Artificial Analysis Intelligence Index - a jump of 11 points over GLM-5.1 - and leads the open-weights pack by a seven-point margin over the nearest competitors. The jump from 40 to 51 is the largest single-generation improvement on the Index among open models.
The post-training recipe involves reinforcement learning specifically designed for long-horizon agentic tasks - code editing, tool use, multi-step problem solving - with what Z.ai calls "anti-hacking" to prevent reward hacking during RL training. This is a notable departure from the standard instruction-tuning plus RLHF pipeline.
Kimi K2.6 from Moonshot AI is one of the strongest open-weight models for developers. The Hugging Face model card lists it under a Modified MIT license with roughly 1.1T parameters - especially strong for coding, tool use, and long-horizon agent workflows.
In a live coding competition on May 3, K2.6 placed first among eight frontier models, ahead of GPT-5.5 and Claude Opus 4.7.
DeepSeek shipped V4 Pro and V4 Flash on April 24, 2026, both MIT licensed with a native 1-million-token context window. V4 Pro is a 1.6-trillion-parameter MoE with 49B active per forward pass; V4 Flash is a leaner 284B with 13B active, built for high-volume, cost-sensitive work.
The self-hosting calculation
For privacy-sensitive teams - legal, healthcare, anyone under GDPR or HIPAA - the open-weight question is inseparable from the self-hosting question. Self-hosting eliminates the data transfer entirely: the model weights run on hardware the deploying organization controls, with no outbound connections after the initial model download. For healthcare providers under HIPAA, legal teams handling client matter data, and financial institutions under GDPR, this is a structural problem that contractual data processing agreements only partially address.
Hardware has become less of an obstacle. NVIDIA's Nemotron 3 Super is a 120B parameter Mixture-of-Experts model that only activates roughly 12B parameters per token. That architectural choice means you get the knowledge capacity of a 120B model with the inference speed of a much smaller one - and it runs comfortably on the DGX Spark's 128GB unified memory, even squeezing onto a 192GB Mac Studio.
The operational cost is real though. A poorly secured self-hosted environment can be more vulnerable than a well-managed cloud platform. Data sovereignty and data security are related but distinct: self-hosting gives you sovereignty; security is your responsibility. A team that treats self-hosting as a compliance checkbox without hardening the inference endpoint has not actually solved the problem.
What the gap number does not tell you
The ECI measures capability ceilings, not operational readiness. For most enterprise teams, the infrastructure gap - observability tooling, security hardening, fine-tuning pipelines - is the actual binding constraint, not the capability ceiling. Faster capability growth widens it.
A model that scores 51 on a benchmark aggregate and one that scores 57 may produce identical results on your actual workload. The benchmark is measuring something real, but not necessarily the thing your team runs all day. Before routing any task to a model, the decision should come down to three questions: Does this prompt contain data I cannot send to a third party? Is this a multi-step agentic chain where reliability across 30+ tool calls matters? And is the per-token cost meaningful at our actual volume?
If the answer to the first is yes, the open-weight stack on your own infrastructure is the correct answer regardless of benchmark position. If the answer to the second is yes and you need sustained coherence across a long chain, the four-month closed-model lead is probably real enough to matter. For everything else - summarization, classification, drafting, RAG, routine coding - open-source AI in 2026 is no longer just the cheaper alternative. For coding, reasoning, agentic workflows, and long-context analysis, open-weight models are now good enough for serious production use.
The gap is four months. Know which tasks live inside it.