Inference is where your AI budget actually lives now

Earlier this week, Baseten finalized a $1.5 billion funding round valuing the company at up to $13 billion. Five months ago it was worth $5 billion. Nine months before that, $2.15 billion. Its annualized revenue run rate climbed from roughly $200 million to $600 million in a single quarter - growth the company attributes to an explosion in apps running open-source models continuously rather than occasionally.

That trajectory is worth pausing on. Baseten does not train models. It does not do research. It sells the unglamorous part of the AI stack: the software and computing capacity businesses need to run inference - the step where a trained model actually answers a query.

The fact that this layer can grow 3x in a quarter tells you something specific about where enterprise AI spending has shifted.

Training was the story. Inference is the bill.

In 2023, the AI cost conversation was about training. Training a large language model required hundreds of millions of dollars in compute, and only the largest labs and hyperscalers could afford it. Most enterprises simply consumed the outputs through APIs, paying a few dollars per million tokens. Inference - the cost of actually running the model - was an afterthought.

That era is over. Deloitte projected in late 2025 that inference workloads will account for roughly two-thirds of all AI compute in 2026, up from one-third just three years ago. The inference market is projected to exceed $50 billion in chip spending alone this year.

The switch happened because the way companies use AI changed. The shift happened because enterprises moved from experimental chatbots to production-scale agentic deployments - and agentic AI consumes tokens in ways that no traditional budget model anticipated. Gartner put a number on the gap: agentic models require between 5 and 30 times more tokens per task than a standard chatbot, and while lower token unit costs will enable more advanced capabilities, these advancements will drive disproportionately higher token demand - meaning overall inference costs are expected to increase even as per-token prices fall.

The model your team uses is now a smaller variable in your AI budget than how often agents call it.

The open-weight shift is real, but serving it is hard

Baseten's core thesis is that open-source models have matured to the point where many enterprise workloads no longer require proprietary offerings. CEO Tuhin Srivastava said, "Open-source models are getting very, very good," noting that customers increasingly combine open and proprietary models depending on task complexity.

The evidence is hard to argue with. In June alone, Z.ai shipped GLM 5.2 - a 744-billion-parameter model that now leads the Artificial Analysis Intelligence Index among open weights and beats GPT-5.5 on several long-horizon coding benchmarks at roughly one-sixth the price. MiniMax released M3 with a 1-million-token context window and native multimodality. DeepSeek, Moonshot, and Alibaba all have current contenders. The frontier is no longer a closed club.

But downloading weights is the easy part. Releases from Meta, Mistral, and DeepSeek have reached quality thresholds where many enterprises no longer need to pay the premium for proprietary APIs - but deploying open-source models efficiently at production scale requires custom compilation to GPU hardware, multi-cloud orchestration, traffic-based autoscaling, and low-latency request handling. That is what Baseten customers are paying for. Some have cut costs sharply by shifting workloads to open-source options, with one reportedly running a task at about 30% of the cost of a proprietary alternative.

What this means for a team building features on top of models

Teams that treat AI costs as a pricing problem will underperform teams that treat it as a systems problem. Token price matters, but the true bill depends on app design, prompt discipline, routing rules, context management, and usage patterns. Companies that learn that early can afford a broader rollout. Companies that do not will end up paying frontier-model rates for work that never needed frontier-model treatment.

In practice, that means a few concrete decisions that most teams are not making explicitly:

Which requests actually need frontier models? Routing a classification task to a smaller open-weight model instead of GPT-5.5 can cost 20-30x less for equivalent output.
How many tokens does each agent loop burn? Mapping every agent loop and identifying the token multiplier for each workflow is worth doing now. Any agentic pipeline consuming more than 10x tokens per user-initiated task needs architectural review. This single audit typically reveals 40-60% of inference waste.
Are always-on background agents really always-on? The shift from on-demand to always-on AI is the most transformative - and expensive - change in enterprise AI. Monitoring agents that scan emails, logs, and operational systems in real time consume compute continuously, even when no human is actively requesting a response. These background inference workloads were essentially absent in 2024 enterprise deployments. In 2026, they represent a growing share of the inference budget.

A teammate like Beagle - living inside Slack and Teams, responding to questions from the whole company - sits squarely in this category. Every ambient trigger, every background lookup, every routed question is an inference call. How those calls are designed matters as much as which model handles them.

The deeper signal from the Baseten round

The $1.8 billion that moved toward inference infrastructure in a single 48-hour window this week reflects a structural consensus forming across the investment community: base models are commoditizing faster than the industry expected, and the leverage in the AI stack is migrating to the layers around the model.

OpenAI, Google, Anthropic, and Meta are all pricing inference below cost to capture market share. When frontier providers subsidize your API calls, it creates a false floor in the market - one that will eventually normalize upward when capital discipline returns to the sector.

That is the less comfortable reading of this week's news. The teams building thoughtfully on inference economics now are not just saving money today. They are building something that will still work when the subsidized pricing era ends.