background-shape
LLM Cost Control and Token Budgets
January 24, 2023 · 4 min read · by Muhammad Amal ai

TL;DR — LLM costs scale per-token. Without controls, one bug = $10K bill. Discipline: per-request token caps, prompt-compression, response caching by input hash, per-user/per-tenant budgets, daily alerts on spend anomaly. Use cheapest model that works.

After streaming, the operational side. LLM integration without cost controls is how teams discover $30K monthly bills nobody approved.

The math you keep doing

text-davinci-003 pricing (Jan 2023): $0.02 per 1K tokens (input + output combined).

  • 500-token classify call: $0.01
  • 2000-token summarization: $0.04
  • 100K calls/day of classification: $1000/day = $30K/month
  • 10K daily summaries: $400/day = $12K/month

Manageable until a bug:

  • Infinite loop calls the API 1000× → $200 minutes
  • User-controlled prompt size unbounded → 4096-token prompts at $0.08 each
  • New “AI feature” launched without rate limiting → 10× expected traffic

Real bills from real teams in 2023. Prevent at the engineering layer.

Per-request budgets

Every call has explicit max_tokens:

def complete(prompt: str, max_response_tokens: int = 300):
    # Hard cap on output
    return openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt,
        max_tokens=max_response_tokens,
    )

max_tokens caps the response. Without it, model can generate up to context limit (4096 tokens for text-davinci-003) = $0.08 per call.

Also cap input:

import tiktoken

enc = tiktoken.encoding_for_model("text-davinci-003")
MAX_PROMPT_TOKENS = 1500

def safe_complete(prompt: str):
    tokens = enc.encode(prompt)
    if len(tokens) > MAX_PROMPT_TOKENS:
        raise ValueError(f"prompt too long: {len(tokens)} > {MAX_PROMPT_TOKENS}")
    return complete(prompt)

User-controlled input feeding into a prompt? Validate length explicitly.

Cache by input hash

Same prompt → same response (with temperature=0). Cache:

import hashlib
import redis

r = redis.Redis()

def cached_complete(prompt: str, max_tokens: int = 300) -> str:
    key = f"llm:{hashlib.sha256(prompt.encode()).hexdigest()}"
    cached = r.get(key)
    if cached:
        return cached.decode()
    result = complete(prompt, max_tokens)
    r.setex(key, 86400 * 30, result)  # 30 days
    return result

Hit rate depends on how repetitive your prompts are. For classification with similar tickets, 30-50% cache hit is realistic = 30-50% cost reduction.

Watch cache keys: long prompts produce many unique hashes. Cache strategy works best for short, recurring prompts.

Prompt compression

Shorter prompts = lower cost. Compress without losing meaning:

Strip non-essential phrases. “Please carefully analyze…” → “Analyze…”

Use abbreviations the model knows. “category” → “cat” if the schema’s known.

Remove repeated context. If you give 5 examples, examples 4-5 may not help.

Truncate user input. If summarizing emails, cut quoted-reply chains.

Be careful: compression at the cost of accuracy is a false economy. Measure both.

Cheaper-model fallback

Not every task needs the biggest model:

  • Classification: text-curie-001 ($0.002/1K) often as good as text-davinci-003 ($0.02/1K). 10× cheaper.
  • Simple extraction: text-babbage-001 ($0.0005/1K).
  • Generation / creative: keep text-davinci-003.

Test cheaper models on your eval set; deploy whichever passes accuracy bar.

(Note: by March 2023, gpt-3.5-turbo arrives at $0.002/1K with quality similar to text-davinci-003. Migrate then.)

Per-user / per-tenant rate limits

For multi-tenant SaaS:

def check_user_budget(user_id: str, cost_estimate: float):
    key = f"llm_spend:user:{user_id}:daily"
    current = float(r.get(key) or 0)
    if current + cost_estimate > USER_DAILY_BUDGET:
        raise BudgetExceeded()
    r.incrbyfloat(key, cost_estimate)
    r.expire(key, 86400)

Per-user daily cap. Without it, one user’s app can rack up the whole month’s budget in a day.

For free-tier users: $0.50/day. Paid: $10/day. Enterprise: negotiated.

Daily spend alerts

In Prometheus + Alertmanager:

- alert: LLMSpendSpike
  expr: |
    sum(rate(llm_tokens_total[1h])) * 0.02 * 3600 > 50
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "LLM hourly spend > $50"

50/hr × 24h = $1200/day. Set thresholds based on your normal baseline + 50% headroom.

OpenAI Console also shows usage with delay; don’t rely on it for real-time.

Logging cost per request

import logging

def complete_observed(prompt: str, **kwargs):
    tokens_in = len(enc.encode(prompt))
    max_out = kwargs.get("max_tokens", 300)
    cost_estimate = (tokens_in + max_out) * 0.02 / 1000

    result = complete(prompt, **kwargs)
    tokens_out = len(enc.encode(result))
    cost_actual = (tokens_in + tokens_out) * 0.02 / 1000

    logging.info("llm_call", extra={
        "tokens_in": tokens_in,
        "tokens_out": tokens_out,
        "cost_usd": cost_actual,
        "model": "text-davinci-003",
    })
    return result

Ships to your observability stack. Aggregate daily; pivot by user, endpoint, model.

The kill switch

Have one:

LLM_DISABLED = os.environ.get("LLM_DISABLED", "false").lower() == "true"

def complete(prompt: str, **kwargs):
    if LLM_DISABLED:
        raise ServiceUnavailable("LLM temporarily disabled")
    return _complete(prompt, **kwargs)

Set LLM_DISABLED=true env var; deploy; all LLM calls fail. Saved at least one team from a runaway-loop midnight.

Common Pitfalls

No max_tokens. Single call can hit context limit; $0.08 per request.

No input validation. User submits 10MB string; tokenized; thousands of tokens; high cost.

Cache by raw prompt without hash. Long prompts as Redis keys = Redis memory pressure.

No per-tenant budgets. One bad customer drains the whole budget.

No spend alerts. First notice is the invoice.

Always-on premium model. Where cheaper works, use it.

Caching responses with PII without thought. Cache hits across users = data leak.

Wrapping Up

Cost discipline = max_tokens + input caps + cache + cheap model + per-user budget + alerts + kill switch. Friday: error handling + retries.