Error Handling and Retries for LLM APIs
TL;DR — OpenAI has outages. Transient errors (429, 500, 503, timeout) get exponential backoff retries. Permanent errors (400, 401, 404) get no retry. Have a fallback: cached response, rule-based default, or “service degraded” message. Never let LLM downtime take your service down.
After cost control, the reliability layer. OpenAI’s status page in early 2023 has weekly minor incidents and occasional major outages. Your service shouldn’t share its uptime.
OpenAI error types
The openai Python library raises specific exceptions:
| Exception | HTTP | Retry? | Cause |
|---|---|---|---|
RateLimitError |
429 | yes (backoff) | exceeded quota |
Timeout |
n/a | yes | request took too long |
APIConnectionError |
n/a | yes | network issue |
APIError |
500/502/503 | yes | OpenAI server error |
ServiceUnavailableError |
503 | yes | OpenAI overloaded |
InvalidRequestError |
400 | no | bad prompt / params |
AuthenticationError |
401 | no | bad API key |
PermissionError |
403 | no | access denied |
Retry on the top five; fail-fast on the bottom three.
Retry pattern in Python
import openai
from tenacity import (
retry, retry_if_exception_type, stop_after_attempt,
wait_exponential, before_sleep_log
)
import logging
log = logging.getLogger(__name__)
RETRYABLE = (
openai.error.RateLimitError,
openai.error.Timeout,
openai.error.APIConnectionError,
openai.error.APIError,
openai.error.ServiceUnavailableError,
)
@retry(
retry=retry_if_exception_type(RETRYABLE),
wait=wait_exponential(multiplier=2, min=4, max=60),
stop=stop_after_attempt(5),
before_sleep=before_sleep_log(log, logging.WARNING),
)
def complete_resilient(prompt: str, **kwargs) -> str:
return openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
timeout=30,
**kwargs,
).choices[0].text.strip()
5 attempts; 4s, 8s, 16s, 32s, 60s waits. Total worst case ~2 minutes.
Special handling: rate limits
OpenAI’s 429 includes a Retry-After header. Honor it:
import time
class RetryAfterRetryer:
def __init__(self, max_attempts=5):
self.max_attempts = max_attempts
def call(self, fn, *args, **kwargs):
for attempt in range(self.max_attempts):
try:
return fn(*args, **kwargs)
except openai.error.RateLimitError as e:
retry_after = getattr(e, 'headers', {}).get('retry-after')
wait = int(retry_after) if retry_after else (2 ** attempt)
log.warning(f"rate limited; waiting {wait}s")
time.sleep(wait)
raise
Hammering after a 429 makes OpenAI throttle harder. Wait what they say.
Distinguishing transient from permanent
def safe_complete(prompt: str) -> str:
try:
return complete_resilient(prompt)
except openai.error.InvalidRequestError as e:
log.error(f"prompt invalid: {e}")
# Don't retry. Fix the prompt.
raise BadPromptError(str(e))
except openai.error.AuthenticationError:
log.critical("openai auth failed")
# Critical: trigger pager
alert_oncall("openai_auth_failure")
raise
except RETRYABLE as e:
log.error(f"openai transient after retries: {e}")
# Already retried in complete_resilient; fall back
return fallback_response(prompt)
Three error categories, three responses:
- Permanent code bugs → fail fast, log, alert
- Auth failures → page someone; service is mis-configured
- Transient → fall back gracefully
Fallback patterns
When LLM is down, three options:
Option A — cached response. If same prompt was answered before, return that:
def fallback_from_cache(prompt: str) -> str:
cached = r.get(f"llm:{hash(prompt)}")
if cached:
return cached.decode()
raise ServiceUnavailable()
Option B — rule-based default. For classification:
def fallback_classify(text: str) -> dict:
# Crude rules
text_lower = text.lower()
if any(w in text_lower for w in ["bill", "charge", "refund", "payment"]):
return {"category": "billing", "confidence": 0.5}
if any(w in text_lower for w in ["error", "bug", "broken", "crash"]):
return {"category": "technical", "confidence": 0.5}
return {"category": "other", "confidence": 0.3}
Lower confidence flagged. Downstream knows not to trust the result blindly.
Option C — degraded service message. For chat features:
"Our AI assistant is temporarily unavailable. Your message has been queued and we'll respond shortly."
Then write to a queue; process when API is back.
Circuit breaker
For cascading failure prevention:
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=60, expected_exception=Exception)
def llm_call(prompt: str) -> str:
return complete_resilient(prompt)
After 5 consecutive failures, circuit opens — subsequent calls fail immediately for 60 seconds. Prevents thundering herd when OpenAI is down.
After 60s, circuit half-opens — one test call; if it succeeds, full operation resumes.
Timeout discipline
Always set explicit timeouts:
openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
timeout=30, # absolute max wait
)
Without timeout, a hung request blocks the worker indefinitely. Service degrades silently.
For streaming, set a longer timeout (60-120s) since responses can take that long; for batch jobs, shorter (10-20s).
Idempotency on retries
If a request retries after partial completion, you might be charged for both. OpenAI’s API is idempotent at the cost level: charges per actual API call. A retry that succeeds doesn’t double-charge for the failed call.
For your app’s idempotency (don’t double-create records when retry hits): use an idempotency key:
def llm_then_save(prompt: str, key: str):
# Already done?
if r.exists(f"done:{key}"):
return get_cached_result(key)
result = complete(prompt)
save_result(key, result)
r.setex(f"done:{key}", 86400, "1")
return result
What to alert on
Three alerts I recommend:
- Sustained error rate > 10% over 5 min. Something’s wrong.
- Circuit breaker open for > 2 min. Fallback active; degraded.
- Auth errors at all. Misconfiguration.
All three should reach the on-call channel; only auth-error is page-worthy.
Common Pitfalls
Retrying on InvalidRequestError. Won’t fix; wastes time + money.
No timeout. Hung request blocks forever.
Fallback that quietly returns wrong data. Mark low-confidence; surface upstream.
Infinite retry. Always stop_after_attempt or max_attempts cap.
Circuit breaker per-instance, not shared. Across replicas, all 10 pods each retry their own 5x; effectively 50 retries. Use Redis-backed breaker if possible.
Logging full prompt + response on every error. Verbose; PII risk. Log identifiers + token counts.
Wrapping Up
Transient retry + fallback + circuit breaker + timeout = LLM integration that survives OpenAI’s bad days. Closes January. February theme: TypeScript dominance + Next.js architecture.