Cost Control for LLM Agents, Token Budgets and Anthropic Prompt Caching
TL;DR — Three levers. Hard token budgets per session, model routing by step type, and prompt caching for stable prefixes. Anthropic’s new caching API can cut input costs by up to 90% on repeated prefixes.
A multi-step agent on GPT-4-turbo costs an order of magnitude more than a single chatbot call. Most teams discover this when the monthly bill arrives and the founder asks what happened. The answer is usually “we have a feedback loop with no cap and a system prompt with five tools whose descriptions total 1800 tokens.” Every iteration of the loop replays the entire conversation including those 1800 tokens. Twelve iterations, twelve replays. The math is brutal.
This post is the cost control toolkit I apply to every agent before it sees production traffic. Hard budgets, model routing, prompt caching, and the measurement discipline that makes any of those choices defensible. There’s no silver bullet here, but the combination reliably cuts agent costs by 60-80% versus the naive implementation.
Cost in May 2024 has one new lever worth highlighting up front. Anthropic shipped prompt caching in beta in April, and it’s a meaningful change. For agents with large stable system prompts (tools, instructions, examples), cached prefixes are charged at 10% of the input rate after the first request. That’s not a marginal optimization; it’s a different cost profile.
Measure before you cut
You can’t optimize what you don’t measure. The first thing I do on a new agent codebase is wire token accounting into the request path. Every LLM call logs prompt tokens, completion tokens, model, and a step label.
import time
from openai import OpenAI
client = OpenAI()
def call_with_metering(messages, model, step_label, **kwargs):
t0 = time.time()
resp = client.chat.completions.create(model=model, messages=messages, **kwargs)
log_event({
"step": step_label,
"model": model,
"prompt_tokens": resp.usage.prompt_tokens,
"completion_tokens": resp.usage.completion_tokens,
"latency_ms": int((time.time() - t0) * 1000),
"session_id": current_session_id(),
})
return resp
Query this in a notebook a week later and you’ll find one or two step labels accounting for half your spend. Those are where optimization has leverage. Everywhere else, leave it alone.
The OpenAI dashboard’s per-project tagging via the user parameter is a cheaper alternative if you don’t want to roll your own logging. Pass user=session_id and you get usage aggregation per session on the dashboard. Not as flexible as your own logs, but enough for a starting view.
Hard budgets per session
Every agent loop needs a hard token cap. Not a step cap, a token cap. A loop that runs four times but burns 80k tokens per iteration costs more than one that runs ten times with 8k each. Track the cumulative cost as state.
class TokenBudget:
def __init__(self, max_tokens: int):
self.max_tokens = max_tokens
self.consumed = 0
def charge(self, prompt_tokens: int, completion_tokens: int) -> bool:
self.consumed += prompt_tokens + completion_tokens
return self.consumed <= self.max_tokens
def remaining(self) -> int:
return max(0, self.max_tokens - self.consumed)
Wire that into your agent loop. When the budget is exhausted, return whatever partial result you have or trigger a fallback. A graceful “I wasn’t able to fully resolve this, here’s what I found” is better than burning through three times the budget chasing a corner case.
For multi-tenant systems, hard budgets per user per day are also worth setting. The 99th percentile user can cost as much as the median 100 users combined. Catch the runaway before it shows up in next month’s bill.
Model routing
The instinct to use the strongest model everywhere is expensive and usually unjustified. Different steps in an agent loop have different difficulty levels, and matching the model to the task is one of the highest-leverage decisions you can make.
Planning, summarization, extraction, and routing decisions are usually solvable by claude-3-haiku or gpt-3.5-turbo. The hard reasoning steps benefit from claude-3-opus or gpt-4-turbo. The exact split depends on your domain; the principle is universal.
ROUTES = {
"classify_intent": "gpt-3.5-turbo",
"extract_entities": "gpt-3.5-turbo",
"summarize_session": "claude-3-haiku-20240307",
"plan_steps": "gpt-4-turbo",
"execute_step": "gpt-4-turbo",
"critique": "claude-3-sonnet-20240229",
}
def route(step_label: str) -> str:
return ROUTES.get(step_label, "gpt-4-turbo")
Two cautions. First, measure quality on real tasks before deploying a cheaper model. A “we’ll save 80% by switching to haiku” decision that costs you a 10% accuracy drop on the highest-volume step is a net loss. Second, cheaper models are more sensitive to prompt engineering. The prompts that work for gpt-4-turbo will often fail on gpt-3.5-turbo and need rewriting. Budget time for that.
Anthropic prompt caching
This one earns its own section. Anthropic’s prompt caching beta lets you mark portions of your prompt as cacheable. The first request that builds the cache costs 25% more than baseline input pricing. Subsequent requests within the 5-minute TTL that hit the cache cost 10% of baseline input pricing on those tokens.
For an agent with a 3000-token system prompt (tools, instructions, few-shot examples), that’s transformational. Twelve loop iterations would have cost 12 * 3000 = 36,000 input tokens on the system prompt alone. With caching, you pay 3000 * 1.25 once and 3000 * 0.1 eleven more times. That’s 3750 + 3300 = 7050 tokens instead of 36000. An 80% cut on system prompt cost.
from anthropic import Anthropic
client = Anthropic()
TOOLS = [...] # your tool definitions
SYSTEM = "You are a careful customer support agent..." * 100 # large stable prompt
def call_with_cache(user_message: str):
return client.beta.prompt_caching.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM,
"cache_control": {"type": "ephemeral"},
}
],
tools=TOOLS,
messages=[{"role": "user", "content": user_message}],
)
The constraints that matter. Caching is keyed on the exact bytes of the cached block, so anything dynamic in your system prompt (current date, user ID, etc.) goes outside the cached block or you’ll never get a cache hit. The TTL is 5 minutes from last use, which is long enough for an active session but not for cross-session reuse. The minimum cacheable size is 1024 tokens for Sonnet and Opus, 2048 for Haiku, so tiny system prompts don’t qualify.
The response includes cache hit metrics in the usage block. Log them. A cache hit rate below 80% on a long-running session means something dynamic crept into the cached block.
Putting it together
The combined effect of these levers on a typical customer support agent.
Baseline: 12-iteration ReAct loop with gpt-4-turbo, 3000-token system prompt, 500-token average user/assistant exchange per turn. Approximate cost per session: 60k input tokens + 12k output tokens = roughly $0.96 at gpt-4-turbo pricing.
After optimization: planner-executor split with gpt-3.5-turbo planner and claude-3-sonnet executor, prompt caching on the stable Claude system prompt, hard 30k token budget. Approximate cost per session: about $0.18. Five-fold reduction without changing the user-visible behavior.
The number isn’t the point. The point is that each of the three levers (budget, routing, caching) contributes meaningfully and they compose. Skip any one and you leave significant savings on the table.
For more on how the loop structure itself affects cost, see /blog/react-reflexion-planner-executor-agent-loops/.
Common Pitfalls
The traps that turn a careful cost strategy into a careless bill.
- Optimizing without metering. Don’t guess where the cost is. Measure, then cut.
- Setting step caps instead of token caps. A pathological step can blow the budget alone. Measure tokens.
- Routing to a cheaper model without quality testing. Cheap and wrong is more expensive than expensive and right.
- Cache busting on dynamic content. Date, user ID, timestamps in the cached block kill your hit rate. Keep them out.
- Forgetting cache writes are more expensive. The first request pays a premium. If sessions are short and don’t reuse the cache, you’re losing money.
- No budget alerts. A single misconfigured agent can spend a month’s budget in hours. Set alerts on your provider dashboard.
- Ignoring streaming cost. Streaming doesn’t change cost per token, but it changes how you cancel mid-generation. Wire up cancellation so a user closing the page actually stops the call.
- Not benchmarking before/after. Without numbers, you can’t tell if your optimizations work or if you just shifted the cost elsewhere.
Wrapping Up
Agent cost is a manageable engineering problem, not a fact of life. The three levers (budgets, routing, caching) are all available, all cheap to implement, and all underused. Most teams I’ve audited have one of the three in place and could cut costs another 40-60% by adding the other two.
The piece I want to emphasize is the measurement discipline. Without per-step token logs, every cost decision is speculation. With them, you can see exactly where the spend goes and target your effort. Spend a day wiring up the metering before you spend a week optimizing.
Anthropic’s prompt caching is the single biggest thing to land in cost control this quarter, and it doesn’t require any architectural changes to take advantage of. If your agent has a stable system prompt and isn’t using caching, you’re paying full price for tokens you could be getting at 10%. That’s the easiest cost win available, and it should be the first thing you implement after the metering.