The Gap Between Pilot Tokens and Production Tokens

Token prices have fallen by 99% since 2022. Enterprise AI bills are still rising. The reason is not the price per token - it is how many times an agent charges that price to finish one task.

The number on the pricing page did not move. What moved is how many times it gets charged per hour.

LLM inference costs have dropped roughly 10x per year since 2021 - GPT-4-level performance now runs around $0.40 per million tokens versus $30 or more per million tokens in early 2023. That is the headline everyone has heard. The part that follows it less often: the economics of AI have entered a phase that defies conventional logic. While headlines celebrate plummeting token prices, AI companies - and the enterprises building on them - are discovering their bills keep climbing.

The cause is not waste or mismanagement. It is structural.


A chatbot uses one inference call per question. The per-token cost of intelligence has dropped 98% since early 2024, yet enterprise AI bills are still rising. The reason: an agentic workflow - one that calls external tools, verifies outputs, and self-corrects - can trigger 10 to 20 model calls for a single user-initiated task. That changes the relevant unit: it is no longer cost per prompt, but cost per completed task.

Gartner's March 2026 analysis found that agentic models require between 5 and 30 times more tokens per task than a standard chatbot. Enterprises that scaled past the pilot phase discovered this multiplier only after their production bills arrived. The pilot economics bore no relationship to the production economics of multi-step agentic loops running thousands of times per day.

EY put a dollar figure on it. In 2023, a simple linear workflow - input, retrieval, response - cost roughly $0.04 per interaction. In 2026, a more complex orchestrated system involving tools, reasoning, and iterative loops runs about $1.20 per interaction: around 30 times higher.

The Uber case made the rounds for a reason. Claude Code adoption jumped from 32% to 84% of Uber's 5,000-engineer organization between December 2025 and March 2026. By April, the entire annual AI budget was gone. Monthly API costs per engineer were running between $500 and $2,000. That is not a failure of the tool. It is a failure of the cost model that approved the rollout.


There are three things piling up inside any agentic task that the pricing page does not make obvious.

The first is the loop itself. Agentic systems involve multiple LLM calls per user request, with tool definitions, chain-of-thought reasoning, and iterative loops. They require 5-30 times more tokens per task than a standard chat interaction. The second is the context reload. RAG is the industry standard architecture for grounding AI in company-specific documents. But RAG introduces what practitioners call a 'context tax': sending large amounts of documentation to the model with every query, dramatically inflating the token count per inference call. The third is everything outside the model invoice. Per-token prices have fallen roughly 99.7% since GPT-3-era rates, yet enterprise AI bills tripled over the same period - because agentic workflows multiply token usage per task, and 72% of production AI cost sits outside the model invoice in orchestration, retrieval, retries, and observability.

None of these are hidden. They are just easy to miss when you are running a pilot at low volume and paying $0.40 per million tokens.


The fix is not to slow down on agents. The fix is to measure differently from day one.

The teams winning in 2026 are not the ones with the most sophisticated models. They are the ones that measured inference cost on day one, budgeted for a 5-25x agentic multiplier versus chat, and built with constraints in mind.

A few specific moves are worth naming.

Model routing. Classification, extraction, intent detection, document summarization, and the routine logic that makes up the majority of most enterprise agentic workflows do not require frontier capability. The price differential is not marginal - it runs from 20 to 50 times per token. Route the easy steps to smaller, cheaper models. Sending simple tasks to budget models and complex tasks to frontier models can cut costs by 60-90%. Production data suggests that approximately 85% of enterprise queries can be handled by budget-tier models.

Semantic caching. Pairing model routing with semantic caching reduces API call volume by 30-50% for typical enterprise deployments. For agents that answer similar questions repeatedly - think internal knowledge retrieval or support triage - this compounds fast.

Task-level cost accounting. An AI bill showing total monthly spend is not useful. You need cost per completed task, broken out by workflow type. That is the number that tells you whether a given agent is covering its keep. A teammate like Beagle, answering questions inside Slack all day, can have its per-answer cost measured directly against the alternative - someone looking it up manually. Most workflows can be measured the same way if you instrument at the task boundary rather than the token boundary.


The strategic implication is that the competitive moat in AI agent systems will not ultimately be access to cheap inference, but the quality of agent architecture, memory systems, tool integrations, and organizational knowledge embedded in agent behavior. Infrastructure cost will cease to be a differentiator; the quality of what agents do with that compute will be the remaining axis of competition.

That is the right frame for planning, but it does not help the team whose Q3 budget is already overrun. The practical task for right now is simpler: stop pricing agents like chatbots. Measure at the task, route by complexity, and find out before production what the production number actually is.