LLM Cost Control and Token Budgets
TL;DR — LLM costs scale per-token. Without controls, one bug = $10K bill. Discipline: per-request token caps, prompt-compression, response caching by input hash, per-user/per-tenant budgets, daily alerts on spend anomaly. Use cheapest model that works.
After streaming, the operational side. LLM integration without cost controls is how teams discover $30K monthly bills nobody approved.
The math you keep doing
text-davinci-003 pricing (Jan 2023): $0.02 per 1K tokens (input + output combined).
- 500-token classify call: $0.01
- 2000-token summarization: $0.04
- 100K calls/day of classification: $1000/day = $30K/month
- 10K daily summaries: $400/day = $12K/month
Manageable until a bug:
- Infinite loop calls the API 1000× → $200 minutes
- User-controlled prompt size unbounded → 4096-token prompts at $0.08 each
- New “AI feature” launched without rate limiting → 10× expected traffic
Real bills from real teams in 2023. Prevent at the engineering layer.
Per-request budgets
Every call has explicit max_tokens:
def complete(prompt: str, max_response_tokens: int = 300):
# Hard cap on output
return openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
max_tokens=max_response_tokens,
)
max_tokens caps the response. Without it, model can generate up to context limit (4096 tokens for text-davinci-003) = $0.08 per call.
Also cap input:
import tiktoken
enc = tiktoken.encoding_for_model("text-davinci-003")
MAX_PROMPT_TOKENS = 1500
def safe_complete(prompt: str):
tokens = enc.encode(prompt)
if len(tokens) > MAX_PROMPT_TOKENS:
raise ValueError(f"prompt too long: {len(tokens)} > {MAX_PROMPT_TOKENS}")
return complete(prompt)
User-controlled input feeding into a prompt? Validate length explicitly.
Cache by input hash
Same prompt → same response (with temperature=0). Cache:
import hashlib
import redis
r = redis.Redis()
def cached_complete(prompt: str, max_tokens: int = 300) -> str:
key = f"llm:{hashlib.sha256(prompt.encode()).hexdigest()}"
cached = r.get(key)
if cached:
return cached.decode()
result = complete(prompt, max_tokens)
r.setex(key, 86400 * 30, result) # 30 days
return result
Hit rate depends on how repetitive your prompts are. For classification with similar tickets, 30-50% cache hit is realistic = 30-50% cost reduction.
Watch cache keys: long prompts produce many unique hashes. Cache strategy works best for short, recurring prompts.
Prompt compression
Shorter prompts = lower cost. Compress without losing meaning:
Strip non-essential phrases. “Please carefully analyze…” → “Analyze…”
Use abbreviations the model knows. “category” → “cat” if the schema’s known.
Remove repeated context. If you give 5 examples, examples 4-5 may not help.
Truncate user input. If summarizing emails, cut quoted-reply chains.
Be careful: compression at the cost of accuracy is a false economy. Measure both.
Cheaper-model fallback
Not every task needs the biggest model:
- Classification: text-curie-001 ($0.002/1K) often as good as text-davinci-003 ($0.02/1K). 10× cheaper.
- Simple extraction: text-babbage-001 ($0.0005/1K).
- Generation / creative: keep text-davinci-003.
Test cheaper models on your eval set; deploy whichever passes accuracy bar.
(Note: by March 2023, gpt-3.5-turbo arrives at $0.002/1K with quality similar to text-davinci-003. Migrate then.)
Per-user / per-tenant rate limits
For multi-tenant SaaS:
def check_user_budget(user_id: str, cost_estimate: float):
key = f"llm_spend:user:{user_id}:daily"
current = float(r.get(key) or 0)
if current + cost_estimate > USER_DAILY_BUDGET:
raise BudgetExceeded()
r.incrbyfloat(key, cost_estimate)
r.expire(key, 86400)
Per-user daily cap. Without it, one user’s app can rack up the whole month’s budget in a day.
For free-tier users: $0.50/day. Paid: $10/day. Enterprise: negotiated.
Daily spend alerts
In Prometheus + Alertmanager:
- alert: LLMSpendSpike
expr: |
sum(rate(llm_tokens_total[1h])) * 0.02 * 3600 > 50
for: 15m
labels:
severity: warning
annotations:
summary: "LLM hourly spend > $50"
50/hr × 24h = $1200/day. Set thresholds based on your normal baseline + 50% headroom.
OpenAI Console also shows usage with delay; don’t rely on it for real-time.
Logging cost per request
import logging
def complete_observed(prompt: str, **kwargs):
tokens_in = len(enc.encode(prompt))
max_out = kwargs.get("max_tokens", 300)
cost_estimate = (tokens_in + max_out) * 0.02 / 1000
result = complete(prompt, **kwargs)
tokens_out = len(enc.encode(result))
cost_actual = (tokens_in + tokens_out) * 0.02 / 1000
logging.info("llm_call", extra={
"tokens_in": tokens_in,
"tokens_out": tokens_out,
"cost_usd": cost_actual,
"model": "text-davinci-003",
})
return result
Ships to your observability stack. Aggregate daily; pivot by user, endpoint, model.
The kill switch
Have one:
LLM_DISABLED = os.environ.get("LLM_DISABLED", "false").lower() == "true"
def complete(prompt: str, **kwargs):
if LLM_DISABLED:
raise ServiceUnavailable("LLM temporarily disabled")
return _complete(prompt, **kwargs)
Set LLM_DISABLED=true env var; deploy; all LLM calls fail. Saved at least one team from a runaway-loop midnight.
Common Pitfalls
No max_tokens. Single call can hit context limit; $0.08 per request.
No input validation. User submits 10MB string; tokenized; thousands of tokens; high cost.
Cache by raw prompt without hash. Long prompts as Redis keys = Redis memory pressure.
No per-tenant budgets. One bad customer drains the whole budget.
No spend alerts. First notice is the invoice.
Always-on premium model. Where cheaper works, use it.
Caching responses with PII without thought. Cache hits across users = data leak.
Wrapping Up
Cost discipline = max_tokens + input caps + cache + cheap model + per-user budget + alerts + kill switch. Friday: error handling + retries.