Reference Architecture · generation
Token Streaming Pipeline: LLM to UI at Scale
Last updated: April 16, 2026
Quick answer
Use Server-Sent Events (SSE) for unidirectional streaming (LLM → user) via Next.js Route Handlers on Vercel Edge. Back-pressure is handled by the browser's native EventSource buffering. Time-to-first-token (TTFT) drops to 150-500ms vs 3-15s for non-streaming. At scale (10K concurrent streams), use Cloudflare Workers with Durable Objects or a dedicated WebSocket gateway (Ably, Pusher) to offload connection state.
The problem
Without streaming, users stare at a blank screen for 3-15 seconds waiting for a complete LLM response — a UX pattern that increases abandonment by 40-60% compared to streaming. Implementing streaming incorrectly causes partial renders, broken JSON mid-stream, memory leaks from unhandled disconnects, and infrastructure that collapses when 1,000 concurrent users hold open connections. The challenge is managing stateful long-lived connections across a stateless serverless infrastructure.
Architecture
Client UI (Browser / Mobile)
The frontend that initiates the streaming request and renders tokens as they arrive. Uses EventSource API (SSE) for browser clients or WebSocket for bidirectional or mobile clients. Must handle connection drops, reconnection, and partial state gracefully.
Alternatives: Native EventSource API, WebSocket client, React Query + polling (fallback)
API Gateway / Edge Layer
Terminates the client connection, authenticates the request (JWT validation), rate-limits per user, and proxies the stream from the LLM provider to the client. Must not buffer the full response — it must forward bytes as they arrive.
Alternatives: Cloudflare Workers, AWS API Gateway + Lambda Streaming, nginx with proxy_buffering off
LLM Provider (Streaming API)
The LLM API that supports streaming. All major providers (Anthropic, OpenAI, Google) support server-sent events streaming. Streams delta chunks — each chunk contains 1-5 tokens. For Anthropic: `stream=True` returns `text_delta` events.
Alternatives: gpt-4o, gemini-2-flash, Mistral Large, Llama 3.1 via Together AI
Stream Transformer
Converts the LLM provider's native stream format (Anthropic SSE, OpenAI chunks) into the format expected by the client. Handles token accumulation, partial JSON detection (for structured output streaming), and tool-use event routing.
Alternatives: Custom TransformStream (Web Streams API), openai.beta.chat.completions.stream()
Back-Pressure Controller
Prevents memory overflow when the client can't consume tokens as fast as the LLM produces them. Web Streams API's native back-pressure propagation handles this automatically when using ReadableStream. For WebSocket: implement a token bucket with max buffer size before pausing reads.
Alternatives: Custom token bucket rate limiter, WebSocket flow control with ACK messages
Mid-Stream Error Recovery
Handles errors that occur after streaming has started: LLM provider timeouts, rate limits hit mid-stream, network drops. Strategy: send an error SSE event with the partial completion and a retry token, allowing the client to resume from the last received chunk.
Alternatives: Client-side retry with last-event-id header (SSE native), Checkpoint-based resumption
Connection State Manager
Tracks active streaming connections for rate limiting, billing, and debugging. At scale (>1K concurrent), serverless functions cannot hold connection state — this must be offloaded to a stateful layer (Redis, Durable Objects, or a dedicated WebSocket server).
Alternatives: Cloudflare Durable Objects, Ably (managed WebSocket), Socket.io with Redis adapter
Partial JSON Parser
When streaming structured outputs (JSON mode), tokens arrive mid-object — e.g., `{"name": "Joh`. A partial JSON parser renders the object progressively as it becomes parseable. Critical for streaming tool-call arguments and structured data.
Alternatives: @anthropic-ai/sdk streaming with tool_use events, json-stream-parse
Rendered Stream Output
The progressive rendering of LLM output in the UI. For text: append tokens to a string and re-render. For structured data: progressively populate UI components as JSON fields become available. Key metric: TTFT (time-to-first-token) and tokens-per-second render rate.
Alternatives: Markdown renderer with streaming, Code block with syntax highlighting, Structured form auto-fill
The stack
Vercel AI SDK handles SSE reconnection, state management (messages array), streaming text rendering, and error states out of the box. `useChat` reduces custom streaming client code from ~400 lines to ~10 lines. Works with any AI backend, not just Vercel.
Alternatives: Native EventSource API, TanStack Query + custom SSE hook, React Server Components (partial)
Vercel Edge Runtime has <50ms cold start (vs 200-800ms for Lambda) and supports Response with ReadableStream for native streaming. 130+ edge locations globally reduce latency by 40-80% vs a single-region server. Cost: $0.60/M edge function invocations. Critical: do NOT use Node.js runtime for streaming — it buffers responses.
Alternatives: Cloudflare Workers, AWS Lambda Response Streaming (Node 20), Fly.io long-running processes
SSE is simpler: browser-native, auto-reconnect, firewall-friendly (port 80/443), and works through HTTP proxies. WebSocket is required when you need bidirectional communication (client sends data mid-stream), or when streaming to mobile apps where EventSource is not natively available. 90% of LLM chat UIs should use SSE.
Alternatives: WebSocket via Socket.io, HTTP/2 server push (limited browser support), Long polling (fallback for proxies)
Claude Sonnet 4 streams at 60-100 tokens/second, fast enough for smooth UX. Gemini Flash 2.0 is fastest (150-200 tok/s) for latency-critical use cases. TTFT across providers: Claude 150-400ms, GPT-4o 200-500ms, Gemini 100-250ms. For fastest TTFT, use Groq (Llama 3.1 70B at 280 tok/s with 80ms TTFT).
Alternatives: OpenAI GPT-4o with stream=True, Google Gemini 2.0 Flash (150 tok/s), Together AI (open-source models, lower TTFT)
Web Streams API propagates back-pressure automatically through the chain: if the client isn't reading, the ReadableStream controller's desiredSize drops to 0 and the upstream reader pauses. This prevents the edge function from buffering the full LLM response in memory when a slow client can't keep up.
Alternatives: Node.js stream.pipeline(), Custom buffer with pause()/resume()
Serverless functions are stateless — at 10K concurrent streams, you need shared state for rate limiting and billing. Upstash Redis tracks connection counts per user ($0.20/100K commands). Cloudflare Durable Objects provide per-connection persistent state with geographic affinity for <20ms reads. Managed solutions (Ably) cost $0.0006/message but eliminate all infra work.
Alternatives: Ably (managed WebSocket gateway), Pusher Channels, Socket.io with Redis adapter on Fly.io
Cost at each scale
Prototype
5,000 streaming sessions/mo, avg 500 tokens output
$35/mo
Growth
100,000 streaming sessions/mo, avg 1,000 tokens output
$870/mo
Scale
2M streaming sessions/mo, 10K peak concurrent connections
$18,000/mo
Latency budget
Tradeoffs
Failure modes & guardrails
Mitigation: When the client closes the browser tab, the server continues consuming LLM tokens (and paying for them) until the response ends. Fix: listen for the request's `AbortSignal` in the route handler (`req.signal.addEventListener('abort', ...)`) and cancel the LLM API call immediately on client disconnect. This can save 20-40% of LLM costs for interactive use cases.
Mitigation: Middleware that buffers the full stream response (e.g., logging middleware, compression middleware) negates streaming and causes OOM on long responses. Audit every middleware in the request pipeline and ensure `passThrough` mode is used. Disable gzip compression for SSE routes — browsers handle it fine over plain-text SSE.
Mitigation: LLM provider returns a 429 after the stream has started. At this point, the HTTP status code has already been sent to the client (200 OK). Solution: send a specially formatted SSE error event (`event: error, data: {code: 'rate_limit', retryAfter: 60}`) and close the stream. The client's error handler should display a 'rate limit' message and offer a retry button.
Mitigation: Corporate firewalls, proxies (nginx), and some CDNs buffer SSE responses until they reach a certain size or the connection closes. Fix for nginx: set `proxy_buffering off`, `X-Accel-Buffering: no` header. For Cloudflare: the `Cache-Control: no-cache` header bypasses buffering. Always test streaming through your full network stack, not just localhost.
Frequently asked questions
How do I implement streaming in a Next.js 16 App Router route handler?
Use the edge runtime with a ReadableStream response: `export const runtime = 'edge'`. In the route handler, create an Anthropic or OpenAI streaming client, pipe the stream through the Vercel AI SDK's `toAIStream()` helper, and return `new StreamingTextResponse(stream)`. The Vercel AI SDK handles SSE formatting, heartbeats, and error serialization automatically. Don't forget to set `Content-Type: text/event-stream` and `Cache-Control: no-cache` headers.
How do I handle errors that happen mid-stream?
Mid-stream errors are tricky because the HTTP 200 status has already been sent. The standard pattern: (1) catch the error in your stream transformer, (2) send a special SSE event: `event: error\ndata: {message: 'generation failed', code: 'provider_error'}\n\n`, (3) close the stream. On the client, add an `eventSource.addEventListener('error', ...)` handler (distinct from the `onerror` for connection errors) that displays the error and re-enables the submit button.
What's the maximum number of concurrent streaming connections I can handle on Vercel?
Vercel Edge Functions have no explicit concurrency limit, but each function execution has a 30-second limit and uses memory proportional to the response size in flight. In practice, 1,000-5,000 concurrent streams work fine on Vercel Pro. Above 10,000, you'll hit Vercel's function concurrency quotas (default: 1,000 concurrent executions per deployment). Solution: upgrade to Vercel Enterprise, or offload connections to a dedicated WebSocket server (Ably, Fly.io) that maintains long-lived connections independently.
Should I stream tool-use / function calling results?
Yes for better UX, but it's more complex. Anthropic's streaming API sends `input_json_delta` events while the model is building tool call arguments — you can render a 'Calling search tool...' indicator immediately. After the tool executes, stream the model's synthesis of results. The Vercel AI SDK's `streamObject()` and the `experimental_toolCallStreaming` option handle this pattern. For OpenAI, parse the `delta.tool_calls` chunks manually.
Related
Architectures
LLM Function Calling & Tool Use: Production Architecture
Production patterns for LLM tool use: schema design, parallel tool calls, error handling when tools fail, resu...
Customer Support Agent
Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...
Customer Knowledge Base Chatbot
Reference architecture for a high-volume help-center chatbot over 10k support articles. Zendesk-style, cheap p...
Prompt Caching & Cost Optimization: 90% Savings on Repetitive Prompts
Architecture for Anthropic and OpenAI prompt caching: cache design patterns, minimum token thresholds, hit rat...
Multi-Agent Orchestration: Supervisor-Worker Architecture
A production pattern for coordinating multiple specialized LLM agents under a supervisor that delegates tasks,...