background-shape
Migrating to GPT-4 Turbo, What 128K Context Actually Changes
November 8, 2023 · 6 min read · by Muhammad Amal ai

TL;DR — GPT-4 Turbo’s 128K context is real but you should not treat it as a license to skip retrieval. / The pricing drop ($0.01/$0.03 per 1K) makes GPT-4-class quality affordable for internal tools that were stuck on gpt-3.5. / Latency at large context sizes is the surprise — it scales worse than you expect.

OpenAI DevDay was two days ago and the dust hasn’t settled. The headline is GPT-4 Turbo with 128K context at a third of the old GPT-4 price. The Assistants API, JSON mode, parallel function calling, Vision in the chat completions API. A lot to absorb at once.

I spent the last 48 hours pointing our internal chatbot — the one from my previous post on LlamaIndex RAG — at gpt-4-1106-preview and measuring everything. Here’s what changed, what didn’t, and what to actually do about it.

The Cost Math, Honestly

GPT-3.5-turbo-16k was $0.003/1K input, $0.004/1K output. GPT-4-Turbo is $0.01/$0.03. So nominally we got more expensive: roughly 3.3x on input and 7.5x on output. But that’s the wrong comparison.

The right comparison is GPT-4 (the original, 8K context) versus GPT-4 Turbo. Original GPT-4 was $0.03/$0.06. Turbo is $0.01/$0.03. That’s a 3x cut on input and 2x cut on output, and you get 16x the context. For workloads that were already on GPT-4, this is the largest practical price drop OpenAI has shipped.

For our internal chatbot, the per-question cost went from about $0.004 (gpt-3.5-16k) to $0.014 (gpt-4-turbo). For a tool serving maybe 800 questions a day across the engineering org, that’s $11/day versus $3/day. The qualitative improvement in answer quality made that an easy sell.

The lesson: if you were on GPT-4 for quality reasons, migrate today. If you were on GPT-3.5 for cost reasons, run the math against your real traffic — the gap closed considerably but it didn’t disappear.

The 128K Context Trap

Here’s where I want to push back on the prevailing narrative. “128K context means you don’t need RAG anymore.” I’ve seen this take at least twenty times this week and it’s wrong.

Yes, you can stuff a 100K-token corpus into a single prompt. Should you? Three reasons not to:

Cost. 100K input tokens is $1.00 per query at GPT-4 Turbo input pricing. Per query. Run that against a chatbot with even modest traffic and you’re spending real money to do retrieval badly.

Latency. This was the surprise. At 4K context, gpt-4-turbo time-to-first-token sits around 600ms in my testing. At 80K context, it’s closer to 4 seconds. The model doesn’t get instantly faster just because the context window is bigger.

Recall degradation. Anthropic, Stanford, and others have shown the lost-in-the-middle effect persists at large contexts. Information in the middle of a long context is recalled worse than information at the ends. Stuffing 100K tokens does not mean the model sees all 100K tokens equally.

So what is 128K context actually good for in a RAG system? Two things. First, you can stop being so aggressive about top-k truncation — pass 15-20 retrieved chunks instead of 4-5 and let the model sift. Second, conversation history. With 16K you had to summarize aggressively. With 128K you can keep a long thread intact.

The Migration Mechanics

The change in code was trivial. The model name changes, and you raise the context window setting in LlamaIndex.

# openai==1.3.5, llama-index==0.8.68
from llama_index.llms import OpenAI
from llama_index import ServiceContext

llm = OpenAI(
    model="gpt-4-1106-preview",
    temperature=0.0,
    max_tokens=1500,
)

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
    context_window=128000,
    num_output=1500,
)

context_window matters because LlamaIndex uses it to budget the prompt — if it thinks you have 16K it’ll truncate aggressively, and you’ll wonder why your fancy new 128K model is ignoring half the retrieved chunks.

For top-k, I went from 5 to 12 and the answer quality jumped. Past 12 the curve flattened. Your mileage will vary by corpus density.

JSON Mode and Parallel Function Calling

These two features are the most underrated thing from DevDay. JSON mode guarantees valid JSON output:

response = client.chat.completions.create(
    model="gpt-4-1106-preview",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": "You output only valid JSON with keys: answer, sources, confidence."},
        {"role": "user", "content": user_query},
    ],
)

Before this, I had a regex-and-prayer JSON repair function that ran on every response. Gone now. Note the system prompt still needs to mention JSON or the API errors out — that’s documented but easy to miss.

Parallel function calling means the model can request multiple tool invocations in a single response. For our chatbot this matters when someone asks a compound question: “what’s the on-call rotation for payments AND what’s the runbook for the refund queue?” Previously this was two round trips. Now it’s one. Latency improvement is real.

Common Pitfalls

Don’t blindly upgrade your prompts. Prompts tuned for gpt-3.5 often have verbose instructions that gpt-4 doesn’t need. Strip them down. I cut my system prompt from 800 tokens to 280 and quality improved because the model wasn’t trying to satisfy redundant constraints.

Watch the rate limits. GPT-4 Turbo TPM limits at launch are lower than gpt-3.5-turbo. Check your tier in the OpenAI rate limits dashboard before you migrate a high-traffic workload.

The model is still in preview. gpt-4-1106-preview is exactly that — a preview. Production guarantees aren’t there yet. For genuinely critical paths I’d keep a fallback to gpt-4 (the stable one) wired in.

Stop reasoning is different. GPT-4 Turbo seems more willing to refuse or hedge on edge cases that gpt-3.5 would just answer. For internal tools this is usually a feature. For some user-facing products it’ll trip you up.

Output is capped at 4096 tokens regardless of context size. This caught me. 128K in, 4K out, full stop. If you need long outputs you still need to chain.

When To Stay On 3.5

Not every workload needs GPT-4 Turbo. If you’re doing classification, simple extraction, or routing decisions, gpt-3.5-turbo-1106 (also updated at DevDay) is faster and cheaper and the quality is plenty. I kept the metadata-extraction step in our ingest pipeline on gpt-3.5 — it’s a structured task and the model upgrade wasn’t worth 5x the cost.

The pattern I’m settling on: small fast model for routing and extraction, big model for the user-facing synthesis. Mixture of model tiers. The Assistants API hints at this becoming a more standard pattern.

What’s Next

I want to get hands-on with the Assistants API for stateful multi-turn use cases. The threading model is interesting but it locks you into OpenAI infrastructure more than the chat completions API does. Worth experimenting; not yet worth committing.

Next week I’ll dig into the Assistants API for a small internal tool that benefits from server-side state — and where it falls short for anything with audit or compliance requirements.

Wrapping Up

GPT-4 Turbo is the most significant model release of the year if you’re building production LLM systems. The price drop matters. The context window matters less than the marketing suggests but it’s still useful. The new modalities (Vision, JSON mode, parallel tools) are where the surprise productivity wins are.

Migrate. Re-tune your prompts. Watch your latency. And don’t throw away your retriever.