What Does a 1-Million-Token Open-Weight Model Actually Cost You?

A week ago, Shanghai-based MiniMax released M3, an open-weight model that does three things no single open model has done together before: frontier coding-agent performance, a one-million-token context window, and native multimodal capabilities - including image, video, and desktop computer operation - in a single model. The timing is good for anyone tired of paying proprietary prices. But the launch comes loaded with asterisks, and they matter before you route production traffic through it.

What is genuinely new here

The headline is not the benchmark sheet. It is the engine underneath.

The technical foundation is a new attention variant called MiniMax Sparse Attention (MSA). Classic full attention compares every token against every other token, so compute costs grow quadratically with input length. MSA avoids this by calculating attention scores only for selected segments rather than every token pair.

The key-value cache gets split into blocks. A preliminary filtering step decides which blocks are actually relevant to the current query. Only those blocks go into the full calculation, giving M3 a one-million-token context window.

The practical result of that architectural shift: MSA delivers more than 9× faster prefill and more than 15× faster decoding at 1M-token context versus M2, at 1/20th the per-token compute. That is not a marginal efficiency gain - that is the difference between a million-token context being a lab curiosity and being something you can afford to run on real workloads.

Unlike DeepSeek's Multi-head Latent Attention (MLA), which compresses keys and values into a low-dimensional latent space, MSA operates on a standard GQA backbone but utilizes block-level selection on real, uncompressed key-values.

This solves the precision loss and prefix-caching obstacles noted in the M2 paper. That distinction is worth understanding: MLA trades representation fidelity for compression savings; MSA keeps the full KV representation and instead skips the pairs that are not relevant. You do not lose precision - you skip work.

The model itself is a Mixture-of-Experts architecture. M3 has 229.9 billion total parameters but activates just 9.8 billion per token across 256 fine-grained experts - a sparse footprint that keeps inference cheap relative to its capacity.

M3 is also natively multimodal - it was trained with mixed-modality data "from Step 0," per MiniMax's launch blog, rather than having vision bolted on after the fact. That matters for agentic use: a model that learned to see during pre-training integrates visual signals differently than one that had a vision adapter attached later.

The benchmark claims

MiniMax reports M3 at 59.0% on SWE-Bench Pro, 66.0% on Terminal-Bench 2.1, and 83.5 on BrowseComp, surpassing GPT-5.5 and Gemini 3.1 Pro on coding and approaching Claude Opus 4.7.

Those numbers are striking. They are also all vendor-run. M3 scores 59.0% on SWE-Bench Pro, beating GPT-5.5 and Gemini 3.1 Pro and approaching Claude Opus 4.7 - but several results were run on MiniMax's own infrastructure with agent scaffolding, so independent verification is still pending.

To illustrate long-horizon performance, MiniMax ran two internal demonstrations. In one, M3 independently reproduced core experiments from an ICLR 2025 paper on LLM fine-tuning over nearly 12 hours, generating 18 commits and 23 experimental figures. In another, it optimized a matrix multiplication kernel on NVIDIA Hopper GPUs over 24 hours, completing 147 benchmark submissions and 1,959 tool calls, improving peak hardware utilization from 7.6% to 71.3%. Impressive demonstrations. Also not controlled evaluations - MiniMax describes these as demonstrations of long-horizon autonomous execution, not controlled benchmark evaluations.

The cost side of the equation

Pricing is where M3 gets interesting as a practical option. At launch M3 listed on OpenRouter at $0.60 per million input tokens and $2.40 per million output tokens, with a temporary 50% promotional discount bringing it to roughly $0.30 input and $1.20 output per million tokens - a fraction of frontier closed models like Claude Opus and GPT-5.5. For teams running high-volume coding agents or document pipelines that actually need the full context window, that pricing gap is real.

What teams with compliance requirements need to know

This is the part that does not fit neatly into a benchmark table.

Every prompt processed through MiniMax's API is legally accessible to the Chinese government under China's 2017 National Intelligence Law, which requires MiniMax to "support, assist, and cooperate" with Chinese government intelligence work - an obligation that applies to every prompt processed through the company's API endpoint, regardless of where the user is located.

That is not a hypothetical risk or a political opinion. It is a legal structure you need to account for in your data classification decisions before any customer data, internal code, or regulated content goes through the API. A teammate like Beagle that surfaces this kind of context from your security policy docs can help teams catch the issue before a workflow is already in production.

MiniMax describes M3 as an open-weight model, but the definition matters. Open weight means the trained model parameters are made available for download and local deployment. Open source, in the stricter sense, means the training data, training code, and license terms also permit unrestricted commercial use. MiniMax has used a modified-MIT license for prior models, which is closer to open weight than to fully open source. That distinction shapes what you can actually do with the weights once they land - check the license terms before building a product on top of them.

What is incremental versus what is not

The multimodal capability is incremental; other models do this. The coding benchmark scores are competitive but need independent replication. The Chinese AI lab releasing strong open-weight models is not new - DeepSeek V4-Pro currently leads on LiveCodeBench and Codeforces among all evaluated models, and Kimi K2.6 holds a top SWE-Bench Pro score from its April 2026 release.

What is not incremental is the MSA attention mechanism applied at this scale. Getting a million-token context window down to 1/20th of prior compute costs, on a 229B-parameter MoE model, with native multimodality trained in from the beginning - that combination has not been demonstrated in a deployable open-weight model before. The architectural bet is interesting regardless of what the benchmarks end up saying once independent researchers run them.

The practical checklist before you do anything with M3:

Wait for the weights and technical report (due around June 11). Read the license.
Run your own eval on the tasks you actually care about, not the benchmarks MiniMax selected.
Classify your data before sending anything to the API. If it cannot go to a Chinese government endpoint, it cannot go to MiniMax's API.
If self-hosting matters for compliance, watch for GGUF quantizations once weights ship - that is the path to running M3 locally.

The architecture is worth tracking. The launch-day claims deserve skepticism. Both things are true.

What is genuinely new here

The benchmark claims

The cost side of the equation

What teams with compliance requirements need to know

What is incremental versus what is not

Keep reading

Open-Weight Model Benchmarks: What the Scores Actually Mean

Kimi K2.7-Code: Open-Weight Agentic Coding Model Examined

Mistral's Next Open-Weight Model and What Apache 2.0 Actually Buys You