The OpenAI Assistants API in Production, A Cautious Take

The OpenAI Assistants API in Production, A Cautious Take

November 10, 2023 · 6 min read · by Muhammad Amal ai

TL;DR — The Assistants API solves the stateful conversation plumbing that everyone has been writing badly for a year. / It also locks state inside OpenAI’s infrastructure, which makes audit, compliance, and portability harder. / Use it for prototypes and internal tools; think twice before betting a regulated workload on it.

The Assistants API dropped Monday at DevDay and I’ve spent the week porting a small internal tool — an HR policy lookup bot — to it. The experience clarified what this API is for and where it falls apart.

Short version: OpenAI built an opinionated framework. If your app fits the opinions, you get a lot of plumbing for free. If it doesn’t, you’ll fight the abstraction. Most apps are somewhere in between.

What the API Actually Gives You

Three things, mainly. Threads (persistent conversation state), built-in tools (Code Interpreter, Retrieval), and an orchestration loop (Runs).

The thread is the most important piece. Before Assistants, every conversation needed your own database table mapping user_id to a message history, plus logic to summarize when context overflowed. You probably wrote it three times. Now OpenAI holds it.

# openai==1.3.5
from openai import OpenAI

client = OpenAI()

assistant = client.beta.assistants.create(
    name="HR Policy Assistant",
    instructions="You answer questions about company HR policy. Cite the source document.",
    model="gpt-4-1106-preview",
    tools=[{"type": "retrieval"}],
    file_ids=[uploaded_file_id],
)

thread = client.beta.threads.create()

client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="How many sick days do I get in the first year?",
)

run = client.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id=assistant.id,
)

That’s the happy path. It’s clean. There’s no token budget management, no message truncation, no retriever to wire up. You upload PDFs and the built-in Retrieval tool handles chunking and vector search.

Where the Abstraction Leaks

The Run object is asynchronous and you poll for status. This is fine in concept and frustrating in practice.

import time

while True:
    run = client.beta.threads.runs.retrieve(thread_id=thread.id, run_id=run.id)
    if run.status in ("completed", "failed", "expired", "cancelled"):
        break
    if run.status == "requires_action":
        # handle tool calls
        pass
    time.sleep(0.5)

There’s no streaming as of this week. The model is generating the response somewhere and you’re waiting on the other side of a polling loop. For an interactive chat UX this is a regression from chat completions where you stream tokens directly. OpenAI has said streaming is coming. Until it does, the UX feels worse than what you can build yourself.

The built-in Retrieval tool is a black box. You can upload files, you cannot see the chunks, you cannot tune the chunking, you cannot inspect the retrieved context, you cannot evaluate retrieval quality independently of generation. For a v1 demo, fine. For a system I’m responsible for, this is unacceptable. I want to know what the model saw. The Assistants API docs acknowledge the abstraction but don’t give you the hooks.

Pricing is opaque on retrieval. The docs mention a per-GB per-assistant per-day storage fee. Token costs for the retrieval tool’s internal calls don’t show up clearly in the response object. You’ll find out at the end of the month.

What I’d Use It For

There is a real use case here despite the warts. Internal tools where the team owns the policy data and the audit surface is low — a quick HR bot, a side-by-side documentation assistant, an onboarding helper. The “ship in two hours” path is genuinely two hours. Before this API, it was two days.

For a regulated production workload — anything touching customer PII, anything that needs versioned retrievers, anything where I need to prove what data was used in a specific response — I’d stay on the chat completions API and own the stack.

The interesting middle ground is using Assistants for the orchestration loop (tools, function calling) while keeping retrieval on your own infrastructure. You can do this: register your retrieval as a function tool instead of using the built-in Retrieval. You give up the convenience of the built-in tool, you keep the visibility.

Function Calling vs. Built-in Tools

The Assistants API supports both. The decision tree:

Code Interpreter: use the built-in. There’s no reason to roll your own Python sandbox.
Retrieval: use your own (LlamaIndex, LangChain, raw embeddings) registered as a function tool, unless you’re prototyping. The visibility loss with built-in retrieval is too high for production.
Custom domain logic: function tools, obviously.

Function tools work the same way as in chat completions. The Assistants API just wraps the loop:

tools = [{
    "type": "function",
    "function": {
        "name": "lookup_policy",
        "description": "Search HR policy documents",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string"},
                "category": {"type": "string", "enum": ["benefits", "leave", "conduct"]},
            },
            "required": ["query"],
        },
    },
}]

When the run reaches requires_action, you execute the tool and submit the output back. The model continues from there. Parallel function calling, also new at DevDay, lets the model request multiple tools in one go. This is genuinely useful when you have independent lookups that could happen concurrently.

Common Pitfalls

Thread state grows. There is no explicit ceiling on how many messages a thread holds. OpenAI summarizes when context fills, but you can’t see how or when. For long-running threads, periodically start fresh.

Assistant updates are eventual. Updating the assistant’s instructions or model doesn’t immediately propagate to in-flight runs. Be careful when changing prompts in production.

File uploads have limits. 512MB per file, 20 files per assistant at the time of writing. If your knowledge base is bigger than that you need your own retrieval anyway.

No webhooks. When a run completes you find out by polling. For a backend service that runs an Assistant call as part of a larger workflow, polling is a tax. I’d love an event push.

Region and data residency are not clear. For EU customers this matters. The API does not yet give you control over where thread data is stored or processed.

Cost can surprise you. Threads accumulate tokens. Every run replays the message history. A long-lived thread is silently expensive. Monitor your bill.

When I’d Reach For It

I want to be fair: there are real scenarios where this is the right tool.

You’re prototyping a chat experience and you need to be in front of a user in days, not weeks. Use it.

You’re building an internal tool with non-sensitive data and a small team. Use it. The plumbing it removes is real.

You’re shipping a Code Interpreter feature. Use it. There’s no shortcut to building a sandboxed Python execution environment yourself.

You’re building a regulated, multi-tenant SaaS product where every request needs to be auditable and reproducible. Don’t use it yet. Stay on chat completions with your own state and retrieval, and revisit when the API matures.

If you missed it, I wrote about migrating to GPT-4 Turbo earlier this week — that’s the underlying model you’d be using here anyway.

What’s Next

I’m watching for streaming support and observability hooks. Without those, the Assistants API stays in the “good for prototypes” bucket for me. With them, it becomes a serious option for a wider set of production workloads.

Next post I want to compare the function-calling capability against LangChain’s agent abstractions, because both are converging on similar ground from different angles.

Wrapping Up

The Assistants API is the most significant developer-facing API shift OpenAI has shipped in months. It’s also a v1 of an opinionated framework, and you should treat it as such. The convenience is real. The lock-in is real. The visibility loss is real.

Use it where it fits, keep your own stack where it doesn’t, and don’t rewrite production code on a beta API.

What the API Actually Gives You

Where the Abstraction Leaks

What I’d Use It For

Function Calling vs. Built-in Tools

Common Pitfalls

When I’d Reach For It

What’s Next

Wrapping Up

Related posts

Migrating to GPT-4 Turbo, What 128K Context Actually Changes

Error Handling and Retries for LLM APIs

LLM Cost Control and Token Budgets

Streaming Responses from LLM APIs

Prompt Engineering Basics for Engineers

Calling OpenAI from Node.js

Calling OpenAI from Python, Patterns and Pitfalls

Why Every Backend Needs an LLM Integration in 2023

Let’s Start a Project