AI bills tripled while token prices fell: the agent-loop math

Token prices fell 98% since 2022. Enterprise AI budgets tripled. The agent-loop math explains the gap and how to model next quarter's budget.

The Editors · 6 min read · Jun 6, 2026

Token prices fell 98%. Enterprise AI bills tripled. The reason isn't pricing.

Three numbers from the last month explain it.

The cost of GPT-4-equivalent intelligence has fallen from roughly $20 per million tokens in late 2022 to about $0.40 today. Average enterprise AI spend over the same window went from $1.2 million per year in 2024 to $7 million in 2026. Anthropic agreed in May to pay $1.25 billion per month for Colossus 1 compute through May 2029, a roughly $45 billion contract over its term.

Two of those curves point in opposite directions on purpose. Per-token costs follow the trajectory of any maturing commodity: down. Per-useful-task costs follow consumption. And consumption climbs faster than price drops.

If you're budgeting AI work for the next quarter or pricing a product on top of it, model the agent-loop math. Headline rate cuts will mislead you.

Why falling token prices push bills up

A single API call to a 2026 frontier model costs less than the same call did in 2023. That part is real, and pricing dashboards confirm it. The trouble is that almost nobody is making the same call. They're making agentic calls.

An agent loop re-sends the full conversation history on every turn so the model can reason over prior steps. A ten-turn session costs closer to 50x a single call, not 10x, because cumulative context dominates the bill. The same source puts a simple chatbot interaction at about $0.001 and a complex coding task at $5 to $8. Same model family, four orders of magnitude apart, because the task shape changed.

Microsoft confirmed the pattern with its wallet. Fortune reported that Microsoft revoked direct Claude Code licenses after six months when individual engineers were spending $500 to $2,000 a month on tokens. Nvidia VP Bryan Catanzaro told the same outlet that for his team, the cost of compute is far beyond the cost of the employees.

Uber's CTO Praveen Neppalli Naga told Fortune the company burned its entire 2026 AI coding budget in four months. That is what consumption-creep looks like under an internal incentive to ship faster.

The hyperscaler tell

If you want the cleanest signal that token revenue is being subsidized by compute spend, read the hyperscaler contracts. Anthropic's revenue run-rate hit roughly $47 billion in May 2026. The same company committed to $1.25 billion per month for Colossus 1 compute through May 2029, about $15 billion per year to a single counterparty. The deal has a 90-day termination clause, which is the only thing keeping it from being existential. At that scale the compute bill is the business.

On the demand side, frontier model price compression is accelerating. Alibaba's Qwen 3.7 Max, released May 21, prices between $1.25 and $2.50 per million input tokens and $3.75 to $7.50 output, depending on provider. That is roughly a third of what Claude Opus 4.7 charges for comparable agentic work.

Cheaper frontier models accelerate consumption. Lower per-token prices let teams run larger prompts, more retries, deeper retrieval, and longer agent loops, which is exactly where consumption-creep happens.

What to do if you're paying the bill

Five moves, ordered by leverage.

Instrument cost per task, not per token. Token meters hide your real unit economics. Tag spend by feature. If a feature costs $0.04 per use in a linear flow and over a dollar once you wrap it in an agent loop, you need that ratio in the dashboard before the next planning cycle.
Cap turns. Every agent loop needs a hard ceiling on iterations and a soft target. Five turns covers most production cases. Anything over ten is a debugging signal, not a feature.
Route by step difficulty. Frontier models for hard subtasks, cheap models for routing and parsing. A fast Sonnet or a Qwen 3.7 Max for the planner; bigger model only when called. Most agent frameworks support this now and most teams haven't switched.
Cache aggressively, then cache again. Prompt caching is the only structural discount in 2026. Implicit and explicit caches at the provider layer cut repeated-context bills by 70 to 90 percent. Most teams turn it on once and never tune it.
Treat AI cost like cloud cost. Tag, alert, budget by feature. The companies whose 2026 bills tripled are the same ones that did not have FinOps for AI yet.

The caveat

This is an agentic-workload story. If your use case is single-turn and short-prompt, like a transcript summary, a customer-support handoff, a clean classification, falling token prices help you the way pricing dashboards say they do. The cost paradox is a structural feature of long-context, multi-turn, tool-using systems. The further your product moves toward agents, the harder the pricing dashboard's good news will hurt your P&L.

What to watch

The Linux Foundation's Tokenomics Foundation launches formally in July with the stated aim of producing canonical metrics like cost-per-intelligence and tokens-per-watt. Standards arrive when an industry can no longer explain its bills to its own customers. That is roughly where AI sits now.

The second signal is what happens to model labs when frontier price compression closes the margin on inference. Anthropic's IPO prospectus, filed confidentially with the SEC on June 1, will put the SpaceX compute line in front of public-market investors who price it as a fixed cost, not a growth story. That repricing is the next thing worth reading.