Streaming Responses from LLM APIs | Hi, I'm Muhammad Amal

Streaming article cover illustration on a gradient background

January 20, 2023 · 4 min read · by Muhammad Amal ai

TL;DR — Streaming sends tokens as they’re generated, not after the full response. UX improvement is dramatic (typing effect vs 30-second wait). Implement via Server-Sent Events. Buffering proxies and CDNs sometimes break it. Disable buffering explicitly.

After few-shot , the UX-critical pattern. A 30-second non-streamed LLM response feels broken. The same response streamed character-by-character feels instant.

How OpenAI’s streaming works

Set stream=True on the API call. The response becomes a server-sent event stream:

data: {"choices":[{"text":"Hello","index":0}]}

data: {"choices":[{"text":" world","index":0}]}

data: {"choices":[{"text":"!","index":0}]}

data: [DONE]

Each event delivers a small chunk (1-5 tokens typically). Client processes as they arrive.

Python — server-side

For a FastAPI endpoint forwarding the stream:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import openai

app = FastAPI()

@app.post("/chat")
async def chat(prompt: str):
    async def event_stream():
        response = await openai.Completion.acreate(
            model="text-davinci-003",
            prompt=prompt,
            max_tokens=500,
            stream=True,
        )
        async for chunk in response:
            text = chunk.choices[0].text
            yield f"data: {json.dumps({'text': text})}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(
        event_stream(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",  # disable nginx buffering
            "Connection": "keep-alive",
        }
    )

Three critical headers:

Cache-Control: no-cache — no intermediate caching
X-Accel-Buffering: no — tells nginx not to buffer the response
Connection: keep-alive — keep the TCP connection open

Without these, an nginx in front of your app buffers the entire response, defeating streaming.

Node — server-side

import express from 'express';
import { Configuration, OpenAIApi } from 'openai';

const openai = new OpenAIApi(new Configuration({ apiKey: process.env.OPENAI_API_KEY }));
const app = express();
app.use(express.json());

app.post('/chat', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('X-Accel-Buffering', 'no');
  res.setHeader('Connection', 'keep-alive');

  const response = await openai.createCompletion({
    model: 'text-davinci-003',
    prompt: req.body.prompt,
    max_tokens: 500,
    stream: true,
  }, { responseType: 'stream' });

  (response.data as any).on('data', (chunk: Buffer) => {
    const lines = chunk.toString().split('\n').filter(Boolean);
    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = line.slice(6);
        if (data === '[DONE]') {
          res.write('data: [DONE]\n\n');
          res.end();
          return;
        }
        try {
          const json = JSON.parse(data);
          const text = json.choices[0].text;
          res.write(`data: ${JSON.stringify({ text })}\n\n`);
        } catch {}
      }
    }
  });

  (response.data as any).on('end', () => res.end());
  (response.data as any).on('error', (err: Error) => {
    res.write(`data: ${JSON.stringify({ error: err.message })}\n\n`);
    res.end();
  });
});

Watch the response close on the OpenAI side; close yours too.

Client-side consumption

Browser:

const eventSource = new EventSource('/chat?prompt=hello');
let buffer = '';

eventSource.onmessage = (event) => {
  if (event.data === '[DONE]') {
    eventSource.close();
    return;
  }
  const { text } = JSON.parse(event.data);
  buffer += text;
  document.getElementById('output').textContent = buffer;
};

eventSource.onerror = (err) => {
  console.error('SSE error', err);
  eventSource.close();
};

For POST bodies (not GET like EventSource forces), use fetch with a streaming reader:

const response = await fetch('/chat', {
  method: 'POST',
  body: JSON.stringify({ prompt: 'hello' }),
  headers: { 'Content-Type': 'application/json' },
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  const chunk = decoder.decode(value);
  // parse SSE format
}

Proxy + CDN gotchas

Nginx: buffers by default. X-Accel-Buffering: no header from your app disables per-request. Or in nginx config:

proxy_buffering off;
proxy_request_buffering off;
proxy_read_timeout 600s;

Cloudflare: streams work but some plans add buffering. SSE generally works; long streams (10+ min) can disconnect. Use websocket fallback if you need very long sessions.

AWS API Gateway: doesn’t support streaming. Use a Lambda function URL or Application Load Balancer instead.

Vercel / Netlify: serverless functions have time limits (10-60 sec depending on plan). For long streams, use Edge Functions which handle streams better.

If streaming works locally but breaks in production: it’s almost always a proxy or platform-level buffering issue.

Backpressure

Client may disconnect before stream finishes (user closes tab). Detect and stop the OpenAI call:

@app.post("/chat")
async def chat(request: Request, prompt: str):
    async def event_stream():
        response = await openai.Completion.acreate(...)
        async for chunk in response:
            if await request.is_disconnected():
                return
            yield f"data: ..."

Without this check, you keep paying OpenAI for tokens nobody receives.

When NOT to stream

Background jobs (nobody waiting in real time)
Structured output where you need to validate the whole JSON before showing anything
Very short responses (<200 tokens; non-streamed is fast enough)
Cases where partial output is misleading (financial calculations, code that needs to compile)

Common Pitfalls

No X-Accel-Buffering: no. Nginx buffers; streaming defeated.

No client disconnect detection. Pay for tokens nobody sees.

Trying to parse JSON output mid-stream. Output isn’t valid JSON until complete. Stream raw tokens to UI; parse after.

Single global SSE connection for many users. Each request gets its own; don’t share.

Forgetting to call res.end(). Connection stays open forever; resource leak.

SSE without keep-alive. Connection drops after default timeout.

Wrapping Up

Streaming = better UX, same cost. SSE + buffer-disabling headers + disconnect detection = production-ready. Tuesday: cost control + token budgets .

How OpenAI’s streaming works

Python — server-side

Node — server-side

Client-side consumption

Proxy + CDN gotchas

Backpressure

When NOT to stream

Common Pitfalls

Wrapping Up

Related posts

The OpenAI Assistants API in Production, A Cautious Take

Migrating to GPT-4 Turbo, What 128K Context Actually Changes

Error Handling and Retries for LLM APIs

LLM Cost Control and Token Budgets

Prompt Engineering Basics for Engineers

Calling OpenAI from Node.js

Calling OpenAI from Python, Patterns and Pitfalls

Why Every Backend Needs an LLM Integration in 2023

Let’s Start a Project