Claude 2.1 vs GPT-4 Turbo, A Side-by-Side at 100K Context

Llm article cover illustration on a gradient background

November 24, 2023 · 6 min read · by Muhammad Amal ai

TL;DR — Claude 2.1 (200K context) and GPT-4 Turbo (128K context) are the first credible long-context production options. / Long-context recall is real but uneven on both — the middle of long contexts is still weaker than the ends. / For document QA at scale, Claude 2.1’s pricing and reduced hallucinations make it competitive; for tools and structured output, GPT-4 Turbo still leads.

Anthropic released Claude 2.1 on Tuesday with a 200K-token context window — the largest available — and a claimed 2x reduction in hallucination rate over Claude 2.0. The timing is sharp: OpenAI’s chaos last week made every engineering team I know reopen the vendor risk conversation, and Anthropic is the obvious second name on the list.

I spent the last two days running both models against the same workload — our internal document QA bot, plus a few synthetic long-context retrieval tests. This is what I found, and what I’d actually use where.

The Setup

For each test I used:

Claude 2.1 via the Anthropic Messages API
GPT-4 Turbo (gpt-4-1106-preview) via the OpenAI Chat Completions API
Same retrieval pipeline (the hybrid Postgres setup from my pgvector post )
Same prompts, adapted minimally for each provider’s preferred format
The 50-question golden set from my evaluation post

# anthropic==0.7.7
from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-2.1",
    max_tokens=1500,
    system="You are an assistant answering questions about internal documentation. Cite sources.",
    messages=[
        {"role": "user", "content": build_user_message(question, retrieved_chunks)},
    ],
)
answer = response.content[0].text

The migration from OpenAI’s API to Anthropic’s took about an hour, mostly to handle the slightly different message format (system prompt is a separate parameter, content blocks are a list).

Long-Context Recall: The “Needle” Test

Both providers published needle-in-a-haystack benchmarks. I ran my own variant: insert a known sentence at controlled depths in a 100K-token document and ask the model to retrieve it.

What I saw, in rough numbers across 20 trials per condition:

Depth	Claude 2.1	GPT-4 Turbo
0-10%	100%	100%
40-60%	65%	55%
90-100%	100%	100%

Both models recall better at the start and end of long contexts than in the middle. This matches the lost in the middle findings from earlier this year and the public benchmarks. Claude 2.1 is somewhat better in the middle but neither is reliable enough that I’d skip retrieval and stuff a 100K document.

Claude 2.1’s released benchmarks reportedly show 30% recall in the middle when the inserted fact contradicts the surrounding document’s tone. I didn’t reproduce that test cleanly but the broad pattern holds: long context is not a substitute for retrieval. It’s a relief valve.

Document QA on the Golden Set

On the same 50-question internal QA set, both models scored similarly on retrieval-grounded answers. The differences were qualitative.

Claude 2.1’s hallucination reduction is noticeable. When the retrieved context doesn’t contain the answer, Claude 2.1 is more likely to say “the provided context does not specify” than GPT-4 Turbo, which sometimes confabulates plausibly. For an internal tool where wrong answers are worse than no answers, this is a meaningful behavior change.

GPT-4 Turbo writes more confident, more compact responses on average. Claude 2.1’s responses are slightly more verbose and include more hedging. For end-user comfort this is mixed — engineers preferred GPT-4 Turbo’s tone, support reps preferred Claude 2.1’s caution.

On accuracy against the reference answers, both landed in the 0.80-0.85 range on LLM-judge correctness. Statistically indistinguishable on a 50-question set. I’d want 500+ before I claimed one was better on this dimension.

Function Calling and Structured Output

Claude 2.1 added tool use in this release but it’s behind GPT-4 Turbo’s function calling in two respects. First, it doesn’t have parallel function calling — Anthropic’s tool use is one tool per turn. Second, Anthropic doesn’t have a JSON mode equivalent: you can ask for JSON in the prompt, you don’t get a guarantee.

For workloads where the LLM is calling tools, GPT-4 Turbo is still the easier choice. The OpenAI SDK’s function calling has matured over multiple releases and the tool use docs reflect that.

For workloads that are pure document QA — read context, answer question, return text — Claude 2.1 is a fair competitor.

Pricing and Cost

Claude 2.1: $0.008/1K input, $0.024/1K output. GPT-4 Turbo: $0.01/1K input, $0.03/1K output.

Claude 2.1 is meaningfully cheaper at the same context size — about 20% on input and 20% on output. For a high-volume document QA workload, that’s a real number.

For the chatbot in question, current daily spend on GPT-4 Turbo is about $14. On Claude 2.1, modeling the same traffic, it’d be around $11. Not transformative. Worth knowing.

Latency

Claude 2.1 is notably slower in my testing. Time-to-first-token at 8K context was around 800ms for Claude 2.1 versus 600ms for GPT-4 Turbo. End-to-end for a 500-token response was about 4.5 seconds versus 3.2. Different infrastructure, different optimization stage, will likely shift.

If your application is latency-sensitive, GPT-4 Turbo is faster today. If you’re processing offline or async (overnight document processing, scheduled report generation), the latency difference doesn’t matter.

What I’d Actually Use Where

After all this:

Interactive chat with tools (the main internal bot, anything with function calls): GPT-4 Turbo stays. Function calling and parallel tool use, faster, easier to get structured output.

Long-document analysis (legal review tool, contract summarization, large-codebase Q&A): Claude 2.1. The 200K context, the reduced hallucinations, the lower price. Where you’d have chunked into 5 passes with GPT-4, you can often do one pass with Claude.

Async batch processing: Whichever is cheaper for the specific token mix. Often Claude 2.1 wins on the input-heavy side.

As a fallback for the other: Strongly recommend. After last weekend, I have both wired up everywhere, with feature flags to swap.

Common Pitfalls

Prompt portability is imperfect. Prompts tuned for GPT-4 Turbo don’t always work as well on Claude 2.1 and vice versa. Anthropic’s models respond better to XML-tagged structure in prompts. OpenAI’s respond better to clear bullet-pointed instructions. Re-tune when you migrate.

SDK differences. OpenAI’s response.usage and Anthropic’s response.usage have different fields. Your observability needs adapters.

Rate limits. Both providers have tiered rate limits and you’ll hit them faster than you expect at GPT-4 / Claude-tier traffic. Plan capacity.

Streaming semantics. Both support streaming, but the chunk shapes are different. If you have a UI that consumes one provider’s stream, the other will need a small adapter.

Long-context cost discipline. With 200K context available, it’s tempting to skip retrieval. Don’t. Cost and latency scale roughly linearly with input tokens. A 150K context query on Claude 2.1 is $1.20. Run it a thousand times a day and you’re paying $36K/year for retrieval you didn’t do.

What’s Next

The interesting question is what happens to retrieval-augmented generation when context windows reach 1M+ tokens and recall improves. Probably not what most people think — retrieval doesn’t go away, but the unit of retrieval gets bigger and the synthesis layer does more work. I’ll write about that hypothesis in more depth when I’ve had time to test it.

Wrapping Up

Claude 2.1 is a credible production option and the first one that gives serious competition to GPT-4 Turbo across the long-context use cases. Use both. Wire them up as failovers for each other. Re-evaluate quarterly because the ground keeps shifting.

The good news for builders this month: you have choices. Six months ago you didn’t.

The Setup

Long-Context Recall: The “Needle” Test

Document QA on the Golden Set

Function Calling and Structured Output

Pricing and Cost

Latency

What I’d Actually Use Where

Common Pitfalls

What’s Next

Wrapping Up

Related posts

LLM Vendor Risk, A Failover Playbook After the OpenAI Weekend

LangChain LCEL vs LlamaIndex, Picking a Framework in Late 2023

LLM Observability in Practice, Logs, Traces, and a Useful Dashboard

Putting a RAG Evaluation Pipeline in CI, The Setup I Actually Use

Hybrid Retrieval with pgvector and BM25, A Practical Walkthrough

Securing an Internal LLM Chatbot, Threats, Boundaries, and What I Got Wrong

The OpenAI Assistants API in Production, A Cautious Take

Migrating to GPT-4 Turbo, What 128K Context Actually Changes

Let’s Start a Project