background-shape
Error Handling and Retries for LLM APIs
January 27, 2023 · 4 min read · by Muhammad Amal ai

TL;DR — OpenAI has outages. Transient errors (429, 500, 503, timeout) get exponential backoff retries. Permanent errors (400, 401, 404) get no retry. Have a fallback: cached response, rule-based default, or “service degraded” message. Never let LLM downtime take your service down.

After cost control, the reliability layer. OpenAI’s status page in early 2023 has weekly minor incidents and occasional major outages. Your service shouldn’t share its uptime.

OpenAI error types

The openai Python library raises specific exceptions:

Exception HTTP Retry? Cause
RateLimitError 429 yes (backoff) exceeded quota
Timeout n/a yes request took too long
APIConnectionError n/a yes network issue
APIError 500/502/503 yes OpenAI server error
ServiceUnavailableError 503 yes OpenAI overloaded
InvalidRequestError 400 no bad prompt / params
AuthenticationError 401 no bad API key
PermissionError 403 no access denied

Retry on the top five; fail-fast on the bottom three.

Retry pattern in Python

import openai
from tenacity import (
    retry, retry_if_exception_type, stop_after_attempt,
    wait_exponential, before_sleep_log
)
import logging

log = logging.getLogger(__name__)

RETRYABLE = (
    openai.error.RateLimitError,
    openai.error.Timeout,
    openai.error.APIConnectionError,
    openai.error.APIError,
    openai.error.ServiceUnavailableError,
)

@retry(
    retry=retry_if_exception_type(RETRYABLE),
    wait=wait_exponential(multiplier=2, min=4, max=60),
    stop=stop_after_attempt(5),
    before_sleep=before_sleep_log(log, logging.WARNING),
)
def complete_resilient(prompt: str, **kwargs) -> str:
    return openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt,
        timeout=30,
        **kwargs,
    ).choices[0].text.strip()

5 attempts; 4s, 8s, 16s, 32s, 60s waits. Total worst case ~2 minutes.

Special handling: rate limits

OpenAI’s 429 includes a Retry-After header. Honor it:

import time

class RetryAfterRetryer:
    def __init__(self, max_attempts=5):
        self.max_attempts = max_attempts

    def call(self, fn, *args, **kwargs):
        for attempt in range(self.max_attempts):
            try:
                return fn(*args, **kwargs)
            except openai.error.RateLimitError as e:
                retry_after = getattr(e, 'headers', {}).get('retry-after')
                wait = int(retry_after) if retry_after else (2 ** attempt)
                log.warning(f"rate limited; waiting {wait}s")
                time.sleep(wait)
        raise

Hammering after a 429 makes OpenAI throttle harder. Wait what they say.

Distinguishing transient from permanent

def safe_complete(prompt: str) -> str:
    try:
        return complete_resilient(prompt)
    except openai.error.InvalidRequestError as e:
        log.error(f"prompt invalid: {e}")
        # Don't retry. Fix the prompt.
        raise BadPromptError(str(e))
    except openai.error.AuthenticationError:
        log.critical("openai auth failed")
        # Critical: trigger pager
        alert_oncall("openai_auth_failure")
        raise
    except RETRYABLE as e:
        log.error(f"openai transient after retries: {e}")
        # Already retried in complete_resilient; fall back
        return fallback_response(prompt)

Three error categories, three responses:

  • Permanent code bugs → fail fast, log, alert
  • Auth failures → page someone; service is mis-configured
  • Transient → fall back gracefully

Fallback patterns

When LLM is down, three options:

Option A — cached response. If same prompt was answered before, return that:

def fallback_from_cache(prompt: str) -> str:
    cached = r.get(f"llm:{hash(prompt)}")
    if cached:
        return cached.decode()
    raise ServiceUnavailable()

Option B — rule-based default. For classification:

def fallback_classify(text: str) -> dict:
    # Crude rules
    text_lower = text.lower()
    if any(w in text_lower for w in ["bill", "charge", "refund", "payment"]):
        return {"category": "billing", "confidence": 0.5}
    if any(w in text_lower for w in ["error", "bug", "broken", "crash"]):
        return {"category": "technical", "confidence": 0.5}
    return {"category": "other", "confidence": 0.3}

Lower confidence flagged. Downstream knows not to trust the result blindly.

Option C — degraded service message. For chat features:

"Our AI assistant is temporarily unavailable. Your message has been queued and we'll respond shortly."

Then write to a queue; process when API is back.

Circuit breaker

For cascading failure prevention:

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60, expected_exception=Exception)
def llm_call(prompt: str) -> str:
    return complete_resilient(prompt)

After 5 consecutive failures, circuit opens — subsequent calls fail immediately for 60 seconds. Prevents thundering herd when OpenAI is down.

After 60s, circuit half-opens — one test call; if it succeeds, full operation resumes.

Timeout discipline

Always set explicit timeouts:

openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt,
    timeout=30,  # absolute max wait
)

Without timeout, a hung request blocks the worker indefinitely. Service degrades silently.

For streaming, set a longer timeout (60-120s) since responses can take that long; for batch jobs, shorter (10-20s).

Idempotency on retries

If a request retries after partial completion, you might be charged for both. OpenAI’s API is idempotent at the cost level: charges per actual API call. A retry that succeeds doesn’t double-charge for the failed call.

For your app’s idempotency (don’t double-create records when retry hits): use an idempotency key:

def llm_then_save(prompt: str, key: str):
    # Already done?
    if r.exists(f"done:{key}"):
        return get_cached_result(key)
    result = complete(prompt)
    save_result(key, result)
    r.setex(f"done:{key}", 86400, "1")
    return result

What to alert on

Three alerts I recommend:

  1. Sustained error rate > 10% over 5 min. Something’s wrong.
  2. Circuit breaker open for > 2 min. Fallback active; degraded.
  3. Auth errors at all. Misconfiguration.

All three should reach the on-call channel; only auth-error is page-worthy.

Common Pitfalls

Retrying on InvalidRequestError. Won’t fix; wastes time + money.

No timeout. Hung request blocks forever.

Fallback that quietly returns wrong data. Mark low-confidence; surface upstream.

Infinite retry. Always stop_after_attempt or max_attempts cap.

Circuit breaker per-instance, not shared. Across replicas, all 10 pods each retry their own 5x; effectively 50 retries. Use Redis-backed breaker if possible.

Logging full prompt + response on every error. Verbose; PII risk. Log identifiers + token counts.

Wrapping Up

Transient retry + fallback + circuit breaker + timeout = LLM integration that survives OpenAI’s bad days. Closes January. February theme: TypeScript dominance + Next.js architecture.