Error Handling and Retries for LLM APIs

Llm article cover illustration on a gradient background

January 27, 2023 · 4 min read · by Muhammad Amal ai

TL;DR — OpenAI has outages. Transient errors (429, 500, 503, timeout) get exponential backoff retries. Permanent errors (400, 401, 404) get no retry. Have a fallback: cached response, rule-based default, or “service degraded” message. Never let LLM downtime take your service down.

After cost control , the reliability layer. OpenAI’s status page in early 2023 has weekly minor incidents and occasional major outages. Your service shouldn’t share its uptime.

OpenAI error types

The openai Python library raises specific exceptions:

Exception	HTTP	Retry?	Cause
`RateLimitError`	429	yes (backoff)	exceeded quota
`Timeout`	n/a	yes	request took too long
`APIConnectionError`	n/a	yes	network issue
`APIError`	500/502/503	yes	OpenAI server error
`ServiceUnavailableError`	503	yes	OpenAI overloaded
`InvalidRequestError`	400	no	bad prompt / params
`AuthenticationError`	401	no	bad API key
`PermissionError`	403	no	access denied

Retry on the top five; fail-fast on the bottom three.

Retry pattern in Python

import openai
from tenacity import (
    retry, retry_if_exception_type, stop_after_attempt,
    wait_exponential, before_sleep_log
)
import logging

log = logging.getLogger(__name__)

RETRYABLE = (
    openai.error.RateLimitError,
    openai.error.Timeout,
    openai.error.APIConnectionError,
    openai.error.APIError,
    openai.error.ServiceUnavailableError,
)

@retry(
    retry=retry_if_exception_type(RETRYABLE),
    wait=wait_exponential(multiplier=2, min=4, max=60),
    stop=stop_after_attempt(5),
    before_sleep=before_sleep_log(log, logging.WARNING),
)
def complete_resilient(prompt: str, **kwargs) -> str:
    return openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt,
        timeout=30,
        **kwargs,
    ).choices[0].text.strip()

5 attempts; 4s, 8s, 16s, 32s, 60s waits. Total worst case ~2 minutes.

Special handling: rate limits

OpenAI’s 429 includes a Retry-After header. Honor it:

import time

class RetryAfterRetryer:
    def __init__(self, max_attempts=5):
        self.max_attempts = max_attempts

    def call(self, fn, *args, **kwargs):
        for attempt in range(self.max_attempts):
            try:
                return fn(*args, **kwargs)
            except openai.error.RateLimitError as e:
                retry_after = getattr(e, 'headers', {}).get('retry-after')
                wait = int(retry_after) if retry_after else (2 ** attempt)
                log.warning(f"rate limited; waiting {wait}s")
                time.sleep(wait)
        raise

Hammering after a 429 makes OpenAI throttle harder. Wait what they say.

Distinguishing transient from permanent

def safe_complete(prompt: str) -> str:
    try:
        return complete_resilient(prompt)
    except openai.error.InvalidRequestError as e:
        log.error(f"prompt invalid: {e}")
        # Don't retry. Fix the prompt.
        raise BadPromptError(str(e))
    except openai.error.AuthenticationError:
        log.critical("openai auth failed")
        # Critical: trigger pager
        alert_oncall("openai_auth_failure")
        raise
    except RETRYABLE as e:
        log.error(f"openai transient after retries: {e}")
        # Already retried in complete_resilient; fall back
        return fallback_response(prompt)

Three error categories, three responses:

Permanent code bugs → fail fast, log, alert
Auth failures → page someone; service is mis-configured
Transient → fall back gracefully

Fallback patterns

When LLM is down, three options:

Option A — cached response. If same prompt was answered before, return that:

def fallback_from_cache(prompt: str) -> str:
    cached = r.get(f"llm:{hash(prompt)}")
    if cached:
        return cached.decode()
    raise ServiceUnavailable()

Option B — rule-based default. For classification:

def fallback_classify(text: str) -> dict:
    # Crude rules
    text_lower = text.lower()
    if any(w in text_lower for w in ["bill", "charge", "refund", "payment"]):
        return {"category": "billing", "confidence": 0.5}
    if any(w in text_lower for w in ["error", "bug", "broken", "crash"]):
        return {"category": "technical", "confidence": 0.5}
    return {"category": "other", "confidence": 0.3}

Lower confidence flagged. Downstream knows not to trust the result blindly.

Option C — degraded service message. For chat features:

"Our AI assistant is temporarily unavailable. Your message has been queued and we'll respond shortly."

Then write to a queue; process when API is back.

Circuit breaker

For cascading failure prevention:

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60, expected_exception=Exception)
def llm_call(prompt: str) -> str:
    return complete_resilient(prompt)

After 5 consecutive failures, circuit opens — subsequent calls fail immediately for 60 seconds. Prevents thundering herd when OpenAI is down.

After 60s, circuit half-opens — one test call; if it succeeds, full operation resumes.

Timeout discipline

Always set explicit timeouts:

openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt,
    timeout=30,  # absolute max wait
)

Without timeout, a hung request blocks the worker indefinitely. Service degrades silently.

For streaming, set a longer timeout (60-120s) since responses can take that long; for batch jobs, shorter (10-20s).

Idempotency on retries

If a request retries after partial completion, you might be charged for both. OpenAI’s API is idempotent at the cost level: charges per actual API call. A retry that succeeds doesn’t double-charge for the failed call.

For your app’s idempotency (don’t double-create records when retry hits): use an idempotency key:

def llm_then_save(prompt: str, key: str):
    # Already done?
    if r.exists(f"done:{key}"):
        return get_cached_result(key)
    result = complete(prompt)
    save_result(key, result)
    r.setex(f"done:{key}", 86400, "1")
    return result

What to alert on

Three alerts I recommend:

Sustained error rate > 10% over 5 min. Something’s wrong.
Circuit breaker open for > 2 min. Fallback active; degraded.
Auth errors at all. Misconfiguration.

All three should reach the on-call channel; only auth-error is page-worthy.

Common Pitfalls

Retrying on InvalidRequestError. Won’t fix; wastes time + money.

No timeout. Hung request blocks forever.

Fallback that quietly returns wrong data. Mark low-confidence; surface upstream.

Infinite retry. Always stop_after_attempt or max_attempts cap.

Circuit breaker per-instance, not shared. Across replicas, all 10 pods each retry their own 5x; effectively 50 retries. Use Redis-backed breaker if possible.

Logging full prompt + response on every error. Verbose; PII risk. Log identifiers + token counts.

Wrapping Up

Transient retry + fallback + circuit breaker + timeout = LLM integration that survives OpenAI’s bad days. Closes January. February theme: TypeScript dominance + Next.js architecture.

OpenAI error types

Retry pattern in Python

Special handling: rate limits

Distinguishing transient from permanent

Fallback patterns

Circuit breaker

Timeout discipline

Idempotency on retries

What to alert on

Common Pitfalls

Wrapping Up

Related posts

The OpenAI Assistants API in Production, A Cautious Take

Migrating to GPT-4 Turbo, What 128K Context Actually Changes

LLM Cost Control and Token Budgets

Streaming Responses from LLM APIs

Prompt Engineering Basics for Engineers

Calling OpenAI from Node.js

Calling OpenAI from Python, Patterns and Pitfalls

Why Every Backend Needs an LLM Integration in 2023

Let’s Start a Project