LLM Cost Control and Token Budgets | Hi, I'm Muhammad Amal

Llm article cover illustration on a gradient background

January 24, 2023 · 4 min read · by Muhammad Amal ai

TL;DR — LLM costs scale per-token. Without controls, one bug = $10K bill. Discipline: per-request token caps, prompt-compression, response caching by input hash, per-user/per-tenant budgets, daily alerts on spend anomaly. Use cheapest model that works.

After streaming , the operational side. LLM integration without cost controls is how teams discover $30K monthly bills nobody approved.

The math you keep doing

text-davinci-003 pricing (Jan 2023): $0.02 per 1K tokens (input + output combined).

500-token classify call: $0.01
2000-token summarization: $0.04
100K calls/day of classification: $1000/day = $30K/month
10K daily summaries: $400/day = $12K/month

Manageable until a bug:

Infinite loop calls the API 1000× → $200 minutes
User-controlled prompt size unbounded → 4096-token prompts at $0.08 each
New “AI feature” launched without rate limiting → 10× expected traffic

Real bills from real teams in 2023. Prevent at the engineering layer.

Per-request budgets

Every call has explicit max_tokens:

def complete(prompt: str, max_response_tokens: int = 300):
    # Hard cap on output
    return openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt,
        max_tokens=max_response_tokens,
    )

max_tokens caps the response. Without it, model can generate up to context limit (4096 tokens for text-davinci-003) = $0.08 per call.

Also cap input:

import tiktoken

enc = tiktoken.encoding_for_model("text-davinci-003")
MAX_PROMPT_TOKENS = 1500

def safe_complete(prompt: str):
    tokens = enc.encode(prompt)
    if len(tokens) > MAX_PROMPT_TOKENS:
        raise ValueError(f"prompt too long: {len(tokens)} > {MAX_PROMPT_TOKENS}")
    return complete(prompt)

User-controlled input feeding into a prompt? Validate length explicitly.

Cache by input hash

Same prompt → same response (with temperature=0). Cache:

import hashlib
import redis

r = redis.Redis()

def cached_complete(prompt: str, max_tokens: int = 300) -> str:
    key = f"llm:{hashlib.sha256(prompt.encode()).hexdigest()}"
    cached = r.get(key)
    if cached:
        return cached.decode()
    result = complete(prompt, max_tokens)
    r.setex(key, 86400 * 30, result)  # 30 days
    return result

Hit rate depends on how repetitive your prompts are. For classification with similar tickets, 30-50% cache hit is realistic = 30-50% cost reduction.

Watch cache keys: long prompts produce many unique hashes. Cache strategy works best for short, recurring prompts.

Prompt compression

Shorter prompts = lower cost. Compress without losing meaning:

Strip non-essential phrases. “Please carefully analyze…” → “Analyze…”

Use abbreviations the model knows. “category” → “cat” if the schema’s known.

Remove repeated context. If you give 5 examples, examples 4-5 may not help.

Truncate user input. If summarizing emails, cut quoted-reply chains.

Be careful: compression at the cost of accuracy is a false economy. Measure both.

Cheaper-model fallback

Not every task needs the biggest model:

Classification: text-curie-001 ($0.002/1K) often as good as text-davinci-003 ($0.02/1K). 10× cheaper.
Simple extraction: text-babbage-001 ($0.0005/1K).
Generation / creative: keep text-davinci-003.

Test cheaper models on your eval set; deploy whichever passes accuracy bar.

(Note: by March 2023, gpt-3.5-turbo arrives at $0.002/1K with quality similar to text-davinci-003. Migrate then.)

Per-user / per-tenant rate limits

For multi-tenant SaaS:

def check_user_budget(user_id: str, cost_estimate: float):
    key = f"llm_spend:user:{user_id}:daily"
    current = float(r.get(key) or 0)
    if current + cost_estimate > USER_DAILY_BUDGET:
        raise BudgetExceeded()
    r.incrbyfloat(key, cost_estimate)
    r.expire(key, 86400)

Per-user daily cap. Without it, one user’s app can rack up the whole month’s budget in a day.

For free-tier users: $0.50/day. Paid: $10/day. Enterprise: negotiated.

Daily spend alerts

In Prometheus + Alertmanager:

- alert: LLMSpendSpike
  expr: |
    sum(rate(llm_tokens_total[1h])) * 0.02 * 3600 > 50
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "LLM hourly spend > $50"

50/hr × 24h = $1200/day. Set thresholds based on your normal baseline + 50% headroom.

OpenAI Console also shows usage with delay; don’t rely on it for real-time.

Logging cost per request

import logging

def complete_observed(prompt: str, **kwargs):
    tokens_in = len(enc.encode(prompt))
    max_out = kwargs.get("max_tokens", 300)
    cost_estimate = (tokens_in + max_out) * 0.02 / 1000

    result = complete(prompt, **kwargs)
    tokens_out = len(enc.encode(result))
    cost_actual = (tokens_in + tokens_out) * 0.02 / 1000

    logging.info("llm_call", extra={
        "tokens_in": tokens_in,
        "tokens_out": tokens_out,
        "cost_usd": cost_actual,
        "model": "text-davinci-003",
    })
    return result

Ships to your observability stack. Aggregate daily; pivot by user, endpoint, model.

The kill switch

Have one:

LLM_DISABLED = os.environ.get("LLM_DISABLED", "false").lower() == "true"

def complete(prompt: str, **kwargs):
    if LLM_DISABLED:
        raise ServiceUnavailable("LLM temporarily disabled")
    return _complete(prompt, **kwargs)

Set LLM_DISABLED=true env var; deploy; all LLM calls fail. Saved at least one team from a runaway-loop midnight.

Common Pitfalls

No max_tokens. Single call can hit context limit; $0.08 per request.

No input validation. User submits 10MB string; tokenized; thousands of tokens; high cost.

Cache by raw prompt without hash. Long prompts as Redis keys = Redis memory pressure.

No per-tenant budgets. One bad customer drains the whole budget.

No spend alerts. First notice is the invoice.

Always-on premium model. Where cheaper works, use it.

Caching responses with PII without thought. Cache hits across users = data leak.

Wrapping Up

Cost discipline = max_tokens + input caps + cache + cheap model + per-user budget + alerts + kill switch. Friday: error handling + retries .

The math you keep doing

Per-request budgets

Cache by input hash

Prompt compression

Cheaper-model fallback

Per-user / per-tenant rate limits

Daily spend alerts

Logging cost per request

The kill switch

Common Pitfalls

Wrapping Up

Related posts

LLM Observability in Practice, Logs, Traces, and a Useful Dashboard

The OpenAI Assistants API in Production, A Cautious Take

Migrating to GPT-4 Turbo, What 128K Context Actually Changes

Error Handling and Retries for LLM APIs

Streaming Responses from LLM APIs

Prompt Engineering Basics for Engineers

Calling OpenAI from Node.js

Calling OpenAI from Python, Patterns and Pitfalls

Let’s Start a Project