background-shape
Streaming Responses from LLM APIs
January 20, 2023 · 4 min read · by Muhammad Amal ai

TL;DR — Streaming sends tokens as they’re generated, not after the full response. UX improvement is dramatic (typing effect vs 30-second wait). Implement via Server-Sent Events. Buffering proxies and CDNs sometimes break it. Disable buffering explicitly.

After few-shot, the UX-critical pattern. A 30-second non-streamed LLM response feels broken. The same response streamed character-by-character feels instant.

How OpenAI’s streaming works

Set stream=True on the API call. The response becomes a server-sent event stream:

data: {"choices":[{"text":"Hello","index":0}]}

data: {"choices":[{"text":" world","index":0}]}

data: {"choices":[{"text":"!","index":0}]}

data: [DONE]

Each event delivers a small chunk (1-5 tokens typically). Client processes as they arrive.

Python — server-side

For a FastAPI endpoint forwarding the stream:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import openai

app = FastAPI()

@app.post("/chat")
async def chat(prompt: str):
    async def event_stream():
        response = await openai.Completion.acreate(
            model="text-davinci-003",
            prompt=prompt,
            max_tokens=500,
            stream=True,
        )
        async for chunk in response:
            text = chunk.choices[0].text
            yield f"data: {json.dumps({'text': text})}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(
        event_stream(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",  # disable nginx buffering
            "Connection": "keep-alive",
        }
    )

Three critical headers:

  • Cache-Control: no-cache — no intermediate caching
  • X-Accel-Buffering: no — tells nginx not to buffer the response
  • Connection: keep-alive — keep the TCP connection open

Without these, an nginx in front of your app buffers the entire response, defeating streaming.

Node — server-side

import express from 'express';
import { Configuration, OpenAIApi } from 'openai';

const openai = new OpenAIApi(new Configuration({ apiKey: process.env.OPENAI_API_KEY }));
const app = express();
app.use(express.json());

app.post('/chat', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('X-Accel-Buffering', 'no');
  res.setHeader('Connection', 'keep-alive');

  const response = await openai.createCompletion({
    model: 'text-davinci-003',
    prompt: req.body.prompt,
    max_tokens: 500,
    stream: true,
  }, { responseType: 'stream' });

  (response.data as any).on('data', (chunk: Buffer) => {
    const lines = chunk.toString().split('\n').filter(Boolean);
    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = line.slice(6);
        if (data === '[DONE]') {
          res.write('data: [DONE]\n\n');
          res.end();
          return;
        }
        try {
          const json = JSON.parse(data);
          const text = json.choices[0].text;
          res.write(`data: ${JSON.stringify({ text })}\n\n`);
        } catch {}
      }
    }
  });

  (response.data as any).on('end', () => res.end());
  (response.data as any).on('error', (err: Error) => {
    res.write(`data: ${JSON.stringify({ error: err.message })}\n\n`);
    res.end();
  });
});

Watch the response close on the OpenAI side; close yours too.

Client-side consumption

Browser:

const eventSource = new EventSource('/chat?prompt=hello');
let buffer = '';

eventSource.onmessage = (event) => {
  if (event.data === '[DONE]') {
    eventSource.close();
    return;
  }
  const { text } = JSON.parse(event.data);
  buffer += text;
  document.getElementById('output').textContent = buffer;
};

eventSource.onerror = (err) => {
  console.error('SSE error', err);
  eventSource.close();
};

For POST bodies (not GET like EventSource forces), use fetch with a streaming reader:

const response = await fetch('/chat', {
  method: 'POST',
  body: JSON.stringify({ prompt: 'hello' }),
  headers: { 'Content-Type': 'application/json' },
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  const chunk = decoder.decode(value);
  // parse SSE format
}

Proxy + CDN gotchas

Nginx: buffers by default. X-Accel-Buffering: no header from your app disables per-request. Or in nginx config:

proxy_buffering off;
proxy_request_buffering off;
proxy_read_timeout 600s;

Cloudflare: streams work but some plans add buffering. SSE generally works; long streams (10+ min) can disconnect. Use websocket fallback if you need very long sessions.

AWS API Gateway: doesn’t support streaming. Use a Lambda function URL or Application Load Balancer instead.

Vercel / Netlify: serverless functions have time limits (10-60 sec depending on plan). For long streams, use Edge Functions which handle streams better.

If streaming works locally but breaks in production: it’s almost always a proxy or platform-level buffering issue.

Backpressure

Client may disconnect before stream finishes (user closes tab). Detect and stop the OpenAI call:

@app.post("/chat")
async def chat(request: Request, prompt: str):
    async def event_stream():
        response = await openai.Completion.acreate(...)
        async for chunk in response:
            if await request.is_disconnected():
                return
            yield f"data: ..."

Without this check, you keep paying OpenAI for tokens nobody receives.

When NOT to stream

  • Background jobs (nobody waiting in real time)
  • Structured output where you need to validate the whole JSON before showing anything
  • Very short responses (<200 tokens; non-streamed is fast enough)
  • Cases where partial output is misleading (financial calculations, code that needs to compile)

Common Pitfalls

No X-Accel-Buffering: no. Nginx buffers; streaming defeated.

No client disconnect detection. Pay for tokens nobody sees.

Trying to parse JSON output mid-stream. Output isn’t valid JSON until complete. Stream raw tokens to UI; parse after.

Single global SSE connection for many users. Each request gets its own; don’t share.

Forgetting to call res.end(). Connection stays open forever; resource leak.

SSE without keep-alive. Connection drops after default timeout.

Wrapping Up

Streaming = better UX, same cost. SSE + buffer-disabling headers + disconnect detection = production-ready. Tuesday: cost control + token budgets.