Streaming Responses from LLM APIs
TL;DR — Streaming sends tokens as they’re generated, not after the full response. UX improvement is dramatic (typing effect vs 30-second wait). Implement via Server-Sent Events. Buffering proxies and CDNs sometimes break it. Disable buffering explicitly.
After few-shot, the UX-critical pattern. A 30-second non-streamed LLM response feels broken. The same response streamed character-by-character feels instant.
How OpenAI’s streaming works
Set stream=True on the API call. The response becomes a server-sent event stream:
data: {"choices":[{"text":"Hello","index":0}]}
data: {"choices":[{"text":" world","index":0}]}
data: {"choices":[{"text":"!","index":0}]}
data: [DONE]
Each event delivers a small chunk (1-5 tokens typically). Client processes as they arrive.
Python — server-side
For a FastAPI endpoint forwarding the stream:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import openai
app = FastAPI()
@app.post("/chat")
async def chat(prompt: str):
async def event_stream():
response = await openai.Completion.acreate(
model="text-davinci-003",
prompt=prompt,
max_tokens=500,
stream=True,
)
async for chunk in response:
text = chunk.choices[0].text
yield f"data: {json.dumps({'text': text})}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(
event_stream(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"X-Accel-Buffering": "no", # disable nginx buffering
"Connection": "keep-alive",
}
)
Three critical headers:
Cache-Control: no-cache— no intermediate cachingX-Accel-Buffering: no— tells nginx not to buffer the responseConnection: keep-alive— keep the TCP connection open
Without these, an nginx in front of your app buffers the entire response, defeating streaming.
Node — server-side
import express from 'express';
import { Configuration, OpenAIApi } from 'openai';
const openai = new OpenAIApi(new Configuration({ apiKey: process.env.OPENAI_API_KEY }));
const app = express();
app.use(express.json());
app.post('/chat', async (req, res) => {
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('X-Accel-Buffering', 'no');
res.setHeader('Connection', 'keep-alive');
const response = await openai.createCompletion({
model: 'text-davinci-003',
prompt: req.body.prompt,
max_tokens: 500,
stream: true,
}, { responseType: 'stream' });
(response.data as any).on('data', (chunk: Buffer) => {
const lines = chunk.toString().split('\n').filter(Boolean);
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = line.slice(6);
if (data === '[DONE]') {
res.write('data: [DONE]\n\n');
res.end();
return;
}
try {
const json = JSON.parse(data);
const text = json.choices[0].text;
res.write(`data: ${JSON.stringify({ text })}\n\n`);
} catch {}
}
}
});
(response.data as any).on('end', () => res.end());
(response.data as any).on('error', (err: Error) => {
res.write(`data: ${JSON.stringify({ error: err.message })}\n\n`);
res.end();
});
});
Watch the response close on the OpenAI side; close yours too.
Client-side consumption
Browser:
const eventSource = new EventSource('/chat?prompt=hello');
let buffer = '';
eventSource.onmessage = (event) => {
if (event.data === '[DONE]') {
eventSource.close();
return;
}
const { text } = JSON.parse(event.data);
buffer += text;
document.getElementById('output').textContent = buffer;
};
eventSource.onerror = (err) => {
console.error('SSE error', err);
eventSource.close();
};
For POST bodies (not GET like EventSource forces), use fetch with a streaming reader:
const response = await fetch('/chat', {
method: 'POST',
body: JSON.stringify({ prompt: 'hello' }),
headers: { 'Content-Type': 'application/json' },
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
// parse SSE format
}
Proxy + CDN gotchas
Nginx: buffers by default. X-Accel-Buffering: no header from your app disables per-request. Or in nginx config:
proxy_buffering off;
proxy_request_buffering off;
proxy_read_timeout 600s;
Cloudflare: streams work but some plans add buffering. SSE generally works; long streams (10+ min) can disconnect. Use websocket fallback if you need very long sessions.
AWS API Gateway: doesn’t support streaming. Use a Lambda function URL or Application Load Balancer instead.
Vercel / Netlify: serverless functions have time limits (10-60 sec depending on plan). For long streams, use Edge Functions which handle streams better.
If streaming works locally but breaks in production: it’s almost always a proxy or platform-level buffering issue.
Backpressure
Client may disconnect before stream finishes (user closes tab). Detect and stop the OpenAI call:
@app.post("/chat")
async def chat(request: Request, prompt: str):
async def event_stream():
response = await openai.Completion.acreate(...)
async for chunk in response:
if await request.is_disconnected():
return
yield f"data: ..."
Without this check, you keep paying OpenAI for tokens nobody receives.
When NOT to stream
- Background jobs (nobody waiting in real time)
- Structured output where you need to validate the whole JSON before showing anything
- Very short responses (<200 tokens; non-streamed is fast enough)
- Cases where partial output is misleading (financial calculations, code that needs to compile)
Common Pitfalls
No X-Accel-Buffering: no. Nginx buffers; streaming defeated.
No client disconnect detection. Pay for tokens nobody sees.
Trying to parse JSON output mid-stream. Output isn’t valid JSON until complete. Stream raw tokens to UI; parse after.
Single global SSE connection for many users. Each request gets its own; don’t share.
Forgetting to call res.end(). Connection stays open forever; resource leak.
SSE without keep-alive. Connection drops after default timeout.
Wrapping Up
Streaming = better UX, same cost. SSE + buffer-disabling headers + disconnect detection = production-ready. Tuesday: cost control + token budgets.