LLM Vendor Risk, A Failover Playbook After the OpenAI Weekend

Llm article cover illustration on a gradient background

November 30, 2023 · 7 min read · by Muhammad Amal ai

TL;DR — November showed us LLM vendor risk is real, even for the most established provider. Plan for it. / A thin abstraction layer over chat completions plus a tested failover path is the 80/20. Don’t over-engineer. / The migration cost from single-provider to multi-provider is small if you do it before you need it and brutal if you do it during an incident.

November started with OpenAI DevDay shipping GPT-4 Turbo, the Assistants API, and a 3x price cut. November ended with OpenAI’s board firing and rehiring Sam Altman over five days while engineering teams worldwide rewrote their failover plans in real time.

If you went into that weekend with a single-provider production LLM stack and no failover path, you didn’t sleep well. If you went in with multi-provider routing already wired up, you slept fine. The difference between the two states is a few weeks of engineering work that almost nobody bothered to do because everything was working.

This post is the playbook I’m rolling out across projects this week. It’s deliberately minimal. The goal is “survive a 48-hour provider outage with degraded but functional service,” not “build a perfect multi-vendor abstraction.” The latter is a year of work. The former is a sprint.

The Threat Model

What are we actually planning for? Three concrete scenarios.

Provider API outage. The chat completions endpoint returns 5xx for hours. Status page acknowledges it. You wait, or you fail over.

Provider degradation. The endpoint responds but latency spikes to 30s p99, or quality regresses (a model variant goes weird), or rate limits drop without notice.

Provider business risk. Pricing changes overnight. A model is deprecated with short notice. Account suspended due to an automated review false positive. The company goes through leadership chaos and customer trust erodes.

The first two are technical and you fail over for them. The third is strategic and you mitigate by having an existing relationship with a second provider, not by failing over reactively.

The Minimum Viable Abstraction

You do not need a heavyweight provider-agnostic framework. You need a thin client layer that exposes the operations your app actually uses, with provider-specific implementations behind it.

# openai==1.3.5, anthropic==0.7.7
from typing import Protocol, AsyncIterator
from dataclasses import dataclass

@dataclass
class LLMResponse:
    text: str
    prompt_tokens: int
    completion_tokens: int
    model: str
    provider: str

class LLMClient(Protocol):
    async def complete(
        self,
        system: str,
        messages: list[dict],
        max_tokens: int = 1500,
        temperature: float = 0.0,
    ) -> LLMResponse: ...

class OpenAIClient:
    def __init__(self, model: str = "gpt-4-1106-preview"):
        from openai import AsyncOpenAI
        self.client = AsyncOpenAI()
        self.model = model

    async def complete(self, system, messages, max_tokens=1500, temperature=0.0):
        msgs = [{"role": "system", "content": system}] + messages
        resp = await self.client.chat.completions.create(
            model=self.model,
            messages=msgs,
            max_tokens=max_tokens,
            temperature=temperature,
        )
        return LLMResponse(
            text=resp.choices[0].message.content,
            prompt_tokens=resp.usage.prompt_tokens,
            completion_tokens=resp.usage.completion_tokens,
            model=self.model,
            provider="openai",
        )

class AnthropicClient:
    def __init__(self, model: str = "claude-2.1"):
        from anthropic import AsyncAnthropic
        self.client = AsyncAnthropic()
        self.model = model

    async def complete(self, system, messages, max_tokens=1500, temperature=0.0):
        resp = await self.client.messages.create(
            model=self.model,
            system=system,
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature,
        )
        return LLMResponse(
            text=resp.content[0].text,
            prompt_tokens=resp.usage.input_tokens,
            completion_tokens=resp.usage.output_tokens,
            model=self.model,
            provider="anthropic",
        )

That’s the entire abstraction for chat. Function calling, streaming, vision — those need provider-specific work and they may not be portable. Cover chat first because chat is what fails most painfully in an outage.

The Failover Logic

Don’t fail over on every error. Network blips happen. Retry first.

import asyncio

class FailoverLLM:
    def __init__(self, primary: LLMClient, secondary: LLMClient):
        self.primary = primary
        self.secondary = secondary

    async def complete(self, **kwargs) -> LLMResponse:
        try:
            return await self._with_retries(self.primary, retries=2, **kwargs)
        except Exception as e:
            log.warning("primary_failed", error=str(e))
            return await self._with_retries(self.secondary, retries=1, **kwargs)

    async def _with_retries(self, client, retries, **kwargs):
        last_exc = None
        for attempt in range(retries + 1):
            try:
                return await client.complete(**kwargs)
            except Exception as e:
                last_exc = e
                if attempt < retries:
                    await asyncio.sleep(0.5 * (2 ** attempt))
        raise last_exc

Two retries on primary (so transient 5xx and rate limits clear). Then one attempt on secondary. Total worst-case latency: roughly 4x a normal call. Acceptable for an incident; not acceptable as steady-state.

For steady-state, use a circuit breaker. After N failures on primary, route all traffic to secondary for M minutes, then probe primary again. Standard pattern; the circuitbreaker package is fine.

What Doesn’t Port

The honest part of this story is what you can’t easily fail over.

Function calling / tool use. The schemas are different between providers. The semantics around parallel calls differ. If your app uses tools heavily, your failover path for tools is “degrade to no-tools” — return the user’s question with an apology, or skip the tool-augmented features.

Fine-tuned models. If you’ve fine-tuned a model on OpenAI, you have nothing to fail over to on Anthropic. Maintain a fine-tuned model on a second provider, or accept that fine-tuned workloads have no failover.

Embeddings. ada-002 embeddings are not compatible with Anthropic’s, Cohere’s, or any other provider’s. If your index is built on ada-002, you cannot suddenly query it with Voyage embeddings. Failover for embeddings means a second pre-built index using a second embedding model, which costs storage and rebuild time. Most teams accept that retrieval falls back to keyword-only during embedding outages.

Assistants API. Server-side state inside OpenAI is not portable. If you built on the Assistants API (see my earlier review ), your failover is “we are offline.”

Vision and DALL-E. No drop-in equivalents across providers right now. Mitigate by graceful degradation in the UI (“image features temporarily unavailable”).

The Multi-Region Question

Azure OpenAI is a separate runtime from OpenAI. Different deployment, different incident surface, different commercial terms. It runs the same models (with different release timing — Azure tends to lag by weeks).

If your primary commercial requirement is OpenAI models specifically, Azure OpenAI is a more honest failover target than Anthropic. The model behavior matches. The output format matches. The integration cost is lower. It is, however, the same vendor at a corporate level — protections against an OpenAI business problem are weaker.

The pattern I’m rolling out:

Tier 1: OpenAI direct (lowest latency, latest models)
Tier 2: Azure OpenAI (same models, lagging by ~2 weeks, different infrastructure)
Tier 3: Anthropic (different vendor, different model family, true diversity)

Failover walks down the tiers. Most incidents are Tier 1, fall back to Tier 2. Catastrophic incidents fall through to Tier 3.

What To Practice

A failover path you’ve never exercised does not work. I am setting up monthly game days:

Force a failure on primary (block the API endpoint at the network layer)
Watch traffic shift to secondary
Measure user-visible latency change
Verify logs, metrics, alerts all fire correctly
Verify the secondary’s rate limits can absorb the load

The first time you run this, you will find broken things. The second time, fewer. By the third or fourth, the failover is real.

Pair this with the eval pipeline from my evaluation post . Run the eval against both providers. If the secondary fails the eval by a wide margin, you have advance notice that failover means degraded quality, and you can plan for that explicitly.

Common Pitfalls

Building a perfect abstraction. “Provider-agnostic LLM gateway” is a tempting project. It will take six months. Build the minimum viable thing this week instead.

Not exercising failover. Documented failover plans that nobody has run will fail when you need them. Run them.

Forgetting the supply chain. If you depend on a downstream LLM-powered service that depends on a single provider, your stack inherits that provider’s risk. Audit your transitive dependencies.

Ignoring rate limits on secondaries. If secondary capacity is sized for 0% of traffic, failing over will instantly rate-limit you. Provision secondary capacity at 30-50% of primary at least.

Failing over too aggressively. A 500 error from a single request is not a provider outage. Two retries, then circuit-breaker logic. Otherwise you’ll flap and make incidents worse.

Treating failover as the whole story. Failover handles availability. It doesn’t handle a price hike or a model deprecation. Those need contract and procurement work, not engineering.

What’s Next

December I want to write about cost governance for LLM apps — budgets, alerts, per-tenant accounting, and how to keep a runaway query from generating a $10K surprise. It’s the operational topic I’ve been asked about most this month.

Wrapping Up

November forced a conversation that engineering teams had been postponing. Vendor risk in LLM platforms is real, the cost of mitigating it before you need it is small, and the cost during an incident is enormous. Build the abstraction. Wire up the secondary. Run the game day. Then go back to shipping features.

We don’t get to choose when the next provider event happens. We do get to choose whether we’re ready when it does.

The Threat Model

The Minimum Viable Abstraction

The Failover Logic

What Doesn’t Port

The Multi-Region Question

What To Practice

Common Pitfalls

What’s Next

Wrapping Up

Related posts

LangChain LCEL vs LlamaIndex, Picking a Framework in Late 2023

Claude 2.1 vs GPT-4 Turbo, A Side-by-Side at 100K Context

LLM Observability in Practice, Logs, Traces, and a Useful Dashboard

Putting a RAG Evaluation Pipeline in CI, The Setup I Actually Use

Hybrid Retrieval with pgvector and BM25, A Practical Walkthrough

Securing an Internal LLM Chatbot, Threats, Boundaries, and What I Got Wrong

The OpenAI Assistants API in Production, A Cautious Take

Migrating to GPT-4 Turbo, What 128K Context Actually Changes

Let’s Start a Project