Uber's CTO said it plainly in a post-mortem this spring: "I'm back to the drawing board, because the budget I thought I would need is blown away already." Claude Code adoption had spread fast - from 32% to 84% of the company's 5,000-engineer org between December 2025 and March 2026, and by April the entire annual AI budget was gone, with monthly API costs per engineer running between $500 and $2,000.
This isn't a story about a company being careless. It's a story about a cost model that looks sane until agents enter the picture - and then breaks in a specific, predictable way that most teams only discover once the bill arrives.
The 1,000× price drop that didn't save anyone money
In late 2022, running a GPT-4-class model cost approximately $20 per million tokens. In early 2026, equivalent performance costs $0.40 per million tokens or less - a 1,000× reduction in just over three years, one of the fastest cost declines in computing history.
That number is real. So is this one: per-token prices have fallen roughly 80% since mid-2023, yet total enterprise AI spending grew 483% from 2024 to 2026. The two facts coexist because the unit of cost changed without anyone updating their budget model.
The shift happened because enterprises moved from experimental chatbots to production-scale agentic AI deployments - and agentic AI consumes tokens in ways that no traditional budget model anticipated. A chatbot sends a message and gets a reply. An agent runs a reasoning loop: it plans, calls a tool, reads the result, calls another tool, checks its own output, and retries. It sends the full accumulated context - including system prompt and conversation history - to the model at every step. By step 20 of a multi-step task, the agent is paying for the same context 20 times over.
A recent arxiv paper analyzing agentic coding tasks found that agentic tasks consume roughly 1,000× more tokens than code reasoning or code chat, with input tokens rather than output tokens driving the overall cost. Token usage is also highly variable: runs on the same task can differ by up to 30× in total tokens, and higher token usage does not translate into higher accuracy.
That last finding is the one worth sitting with. Your agent spending more tokens is not the same as your agent doing better work.
What Gartner's 5-30× figure actually means for a team
Gartner's March 2026 analysis found that agentic AI models require 5-30× more tokens per task than standard chatbots, because a reasoning agent doesn't just send a prompt and receive a completion. The multiplier is not a fixed number - it grows with task length, tool count, and how much the agent retries.
Simple tool-calling agents use 5,000-15,000 tokens per task, while complex multi-agent systems can consume 200,000 to over 1,000,000 tokens per task. Agentic coding workflows average 1-3.5 million tokens per task including retries.
Put that against a real team. Six different clients in 2026 followed the same pattern: an engineering team enables AI coding agents, sets up API keys, and within 90 days the AI bill is the second-largest line item on the engineering ledger after salaries. One client had a single developer hit $4,200 in API fees over a long weekend during an autonomous refactoring run - one developer, three days, on a workload the team had not even validated.
One contributing factor that's easy to miss: MCP tool metadata can consume 40-50% of context windows. If your agent loads a broad set of tools at every step - which most frameworks do by default - you're paying for tool descriptions you're not using.
The output token asymmetry compounds things further. Claude Sonnet 4 charges $3.00 per million input tokens and $15.00 per million output tokens
- a 5× spread. An agent that generates long intermediate reasoning traces at every step pays output prices for work the user never sees.
Reasoning models can consume 100× more tokens internally than they output, creating a cost paradox where cheaper per-token pricing leads to higher total bills.
Three levers that actually move the number
The temptation when bills spike is to pick a cheaper model. That helps at the margin, but it rarely addresses the structural problem. Instead of asking which provider is cheapest, teams should ask which tasks deserve expensive inference - and build a model-routing layer with explicit thresholds rather than sending every request to the best model.
The three changes that consistently show up in post-mortems are:
Context trimming. Long context windows create the temptation to over-prompt, leading to bloated input sizes and unnecessary cost. Retrieval systems, in particular, often pass 10,000+ tokens into a model because they can, not because it's effective. Narrowing what the agent sees per step - summarizing history rather than appending it - is where most of the reclaim comes from.
Model routing. For multi-agent systems, a hierarchical architecture using budget models for worker agents and frontier models only for the lead orchestrator can achieve 97.7% of full-frontier accuracy at roughly 61% of the cost. A teammate like Beagle applies this pattern directly inside Slack and Teams - routing routine lookups to smaller, faster models and escalating only when the task requires it.
Prompt caching. OpenAI's batch endpoint offers bulk inference at roughly 50% of real-time token costs. Caching prevents repeated tokenization and embedding of static components like system prompts or repeated context blocks. If your system prompt is 2,000 tokens and you're running 10,000 agent steps a day, caching that prompt is meaningful money.
Once repeated context can be reused, the expensive part of a request is no longer "all those tokens" - it's the fresh, uncached portion and the output path. That pushes the right unit of analysis toward dollars per successful workflow step, not dollars per raw token count.
The metric you should be tracking instead
Cost per useful output - not cost per token, but cost per completed task - normalizes across models of different sizes and pricing structures. A customer service resolution, a generated report, a qualified lead. Tracking cost-per-token in isolation is like tracking cost-per-line-of-code: it measures activity, not value.
In June 2026, OpenAI CEO Sam Altman told CNBC that questions about whether AI spending will ever produce returns are "the most fair criticism right now of AI." He acknowledged that customers are telling him they have burned through their entire 2026 AI budget already, and that cost concerns went from never coming up to the second-most common issue he hears, in a matter of months.
When the person selling you the product says the ROI question is fair, it deserves a serious answer before you scale further.
The arithmetic works in your favor if you build for it. For AI companies, cost-per-token has replaced FLOPS as the metric that determines business viability: if your inference cost is $1.00 per million tokens and you charge $2.00, your gross margin is 50%. If inference costs drop to $0.40, the same pricing yields 80% margin - or you can cut prices to grow users. The collapse in token prices is genuinely good news. But only if you don't let agents quietly undo it by burning your context budget twenty steps at a time.