Reference Architecture · generation

Token Streaming Pipeline: LLM to UI at Scale

Q: How do I implement streaming in a Next.js 16 App Router route handler?

Use the edge runtime with a ReadableStream response: `export const runtime = 'edge'`. In the route handler, create an Anthropic or OpenAI streaming client, pipe the stream through the Vercel AI SDK's `toAIStream()` helper, and return `new StreamingTextResponse(stream)`. The Vercel AI SDK handles SSE formatting, heartbeats, and error serialization automatically. Don't forget to set `Content-Type: text/event-stream` and `Cache-Control: no-cache` headers.

Q: How do I handle errors that happen mid-stream?

Mid-stream errors are tricky because the HTTP 200 status has already been sent. The standard pattern: (1) catch the error in your stream transformer, (2) send a special SSE event: `event: error\ndata: {message: 'generation failed', code: 'provider_error'}\n\n`, (3) close the stream. On the client, add an `eventSource.addEventListener('error', ...)` handler (distinct from the `onerror` for connection errors) that displays the error and re-enables the submit button.

Q: What's the maximum number of concurrent streaming connections I can handle on Vercel?

Vercel Edge Functions have no explicit concurrency limit, but each function execution has a 30-second limit and uses memory proportional to the response size in flight. In practice, 1,000-5,000 concurrent streams work fine on Vercel Pro. Above 10,000, you'll hit Vercel's function concurrency quotas (default: 1,000 concurrent executions per deployment). Solution: upgrade to Vercel Enterprise, or offload connections to a dedicated WebSocket server (Ably, Fly.io) that maintains long-lived connections independently.

Q: Should I stream tool-use / function calling results?

Yes for better UX, but it's more complex. Anthropic's streaming API sends `input_json_delta` events while the model is building tool call arguments — you can render a 'Calling search tool...' indicator immediately. After the tool executes, stream the model's synthesis of results. The Vercel AI SDK's `streamObject()` and the `experimental_toolCallStreaming` option handle this pattern. For OpenAI, parse the `delta.tool_calls` chunks manually.

Last updated: April 16, 2026

Quick answer

Use Server-Sent Events (SSE) for unidirectional streaming (LLM → user) via Next.js Route Handlers on Vercel Edge. Back-pressure is handled by the browser's native EventSource buffering. Time-to-first-token (TTFT) drops to 150-500ms vs 3-15s for non-streaming. At scale (10K concurrent streams), use Cloudflare Workers with Durable Objects or a dedicated WebSocket gateway (Ably, Pusher) to offload connection state.

The problem

Without streaming, users stare at a blank screen for 3-15 seconds waiting for a complete LLM response — a UX pattern that increases abandonment by 40-60% compared to streaming. Implementing streaming incorrectly causes partial renders, broken JSON mid-stream, memory leaks from unhandled disconnects, and infrastructure that collapses when 1,000 concurrent users hold open connections. The challenge is managing stateful long-lived connections across a stateless serverless infrastructure.

Architecture

input

llm

data

infra

output

Client UI (Browser / Mobile)

The frontend that initiates the streaming request and renders tokens as they arrive. Uses EventSource API (SSE) for browser clients or WebSocket for bidirectional or mobile clients. Must handle connection drops, reconnection, and partial state gracefully.

Alternatives: Native EventSource API, WebSocket client, React Query + polling (fallback)

API Gateway / Edge Layer

Terminates the client connection, authenticates the request (JWT validation), rate-limits per user, and proxies the stream from the LLM provider to the client. Must not buffer the full response — it must forward bytes as they arrive.

Alternatives: Cloudflare Workers, AWS API Gateway + Lambda Streaming, nginx with proxy_buffering off

LLM Provider (Streaming API)

The LLM API that supports streaming. All major providers (Anthropic, OpenAI, Google) support server-sent events streaming. Streams delta chunks — each chunk contains 1-5 tokens. For Anthropic: `stream=True` returns `text_delta` events.

Alternatives: gpt-4o, gemini-2-flash, Mistral Large, Llama 3.1 via Together AI

Stream Transformer

Converts the LLM provider's native stream format (Anthropic SSE, OpenAI chunks) into the format expected by the client. Handles token accumulation, partial JSON detection (for structured output streaming), and tool-use event routing.

Alternatives: Custom TransformStream (Web Streams API), openai.beta.chat.completions.stream()

Back-Pressure Controller

Prevents memory overflow when the client can't consume tokens as fast as the LLM produces them. Web Streams API's native back-pressure propagation handles this automatically when using ReadableStream. For WebSocket: implement a token bucket with max buffer size before pausing reads.

Alternatives: Custom token bucket rate limiter, WebSocket flow control with ACK messages

Mid-Stream Error Recovery

Handles errors that occur after streaming has started: LLM provider timeouts, rate limits hit mid-stream, network drops. Strategy: send an error SSE event with the partial completion and a retry token, allowing the client to resume from the last received chunk.

Alternatives: Client-side retry with last-event-id header (SSE native), Checkpoint-based resumption

Connection State Manager

Tracks active streaming connections for rate limiting, billing, and debugging. At scale (>1K concurrent), serverless functions cannot hold connection state — this must be offloaded to a stateful layer (Redis, Durable Objects, or a dedicated WebSocket server).

Alternatives: Cloudflare Durable Objects, Ably (managed WebSocket), Socket.io with Redis adapter

Partial JSON Parser

When streaming structured outputs (JSON mode), tokens arrive mid-object — e.g., `{"name": "Joh`. A partial JSON parser renders the object progressively as it becomes parseable. Critical for streaming tool-call arguments and structured data.

Alternatives: @anthropic-ai/sdk streaming with tool_use events, json-stream-parse

Rendered Stream Output

The progressive rendering of LLM output in the UI. For text: append tokens to a string and re-render. For structured data: progressively populate UI components as JSON fields become available. Key metric: TTFT (time-to-first-token) and tokens-per-second render rate.

Alternatives: Markdown renderer with streaming, Code block with syntax highlighting, Structured form auto-fill

The stack

Frontend Streaming ClientVercel AI SDK (useChat / useCompletion hooks)

Vercel AI SDK handles SSE reconnection, state management (messages array), streaming text rendering, and error states out of the box. `useChat` reduces custom streaming client code from ~400 lines to ~10 lines. Works with any AI backend, not just Vercel.

Alternatives: Native EventSource API, TanStack Query + custom SSE hook, React Server Components (partial)

Edge Streaming LayerVercel Edge Runtime (Next.js Route Handlers with `runtime = 'edge'`)

Vercel Edge Runtime has <50ms cold start (vs 200-800ms for Lambda) and supports Response with ReadableStream for native streaming. 130+ edge locations globally reduce latency by 40-80% vs a single-region server. Cost: $0.60/M edge function invocations. Critical: do NOT use Node.js runtime for streaming — it buffers responses.

Alternatives: Cloudflare Workers, AWS Lambda Response Streaming (Node 20), Fly.io long-running processes

SSE vs WebSocket DecisionSSE (EventSource) for text streaming; WebSocket for bidirectional / real-time collaboration

SSE is simpler: browser-native, auto-reconnect, firewall-friendly (port 80/443), and works through HTTP proxies. WebSocket is required when you need bidirectional communication (client sends data mid-stream), or when streaming to mobile apps where EventSource is not natively available. 90% of LLM chat UIs should use SSE.

Alternatives: WebSocket via Socket.io, HTTP/2 server push (limited browser support), Long polling (fallback for proxies)

LLM Streaming APIAnthropic claude-sonnet-4 with streaming=True

Claude Sonnet 4 streams at 60-100 tokens/second, fast enough for smooth UX. Gemini Flash 2.0 is fastest (150-200 tok/s) for latency-critical use cases. TTFT across providers: Claude 150-400ms, GPT-4o 200-500ms, Gemini 100-250ms. For fastest TTFT, use Groq (Llama 3.1 70B at 280 tok/s with 80ms TTFT).

Alternatives: OpenAI GPT-4o with stream=True, Google Gemini 2.0 Flash (150 tok/s), Together AI (open-source models, lower TTFT)

Back-Pressure & Flow ControlNative Web Streams API (ReadableStream/TransformStream)

Web Streams API propagates back-pressure automatically through the chain: if the client isn't reading, the ReadableStream controller's desiredSize drops to 0 and the upstream reader pauses. This prevents the edge function from buffering the full LLM response in memory when a slow client can't keep up.

Alternatives: Node.js stream.pipeline(), Custom buffer with pause()/resume()

Connection State at ScaleUpstash Redis (connection count) + Cloudflare Durable Objects (per-connection state)

Serverless functions are stateless — at 10K concurrent streams, you need shared state for rate limiting and billing. Upstash Redis tracks connection counts per user ($0.20/100K commands). Cloudflare Durable Objects provide per-connection persistent state with geographic affinity for <20ms reads. Managed solutions (Ably) cost $0.0006/message but eliminate all infra work.

Alternatives: Ably (managed WebSocket gateway), Pusher Channels, Socket.io with Redis adapter on Fly.io

Cost at each scale

Prototype

5,000 streaming sessions/mo, avg 500 tokens output

$35/mo

Claude Sonnet 4 (2.5M output tokens at $15/M)$37

Vercel Edge Functions (5K invocations, included in hobby)$0

Upstash Redis (connection tracking, free tier)$0

Growth

100,000 streaming sessions/mo, avg 1,000 tokens output

$870/mo

Claude Sonnet 4 (100M output tokens at $15/M, mix with Haiku)$750

Vercel Edge (100K invocations, Pro plan compute)$20

Upstash Redis Pro (connection state)$30

CDN egress (streaming data transfer)$70

Scale

2M streaming sessions/mo, 10K peak concurrent connections

$18,000/mo

LLM API costs (mix of models, 2B tokens output)$12,000

Cloudflare Workers + Durable Objects (connection state)$500

CDN egress and streaming bandwidth$1,500

Monitoring (Datadog / Grafana Cloud)$400

Ably or self-hosted WebSocket gateway$800

Engineering (infra maintenance)$2,800

Latency budget

Total P50: 350ms

Total P95: 900ms

Total

350ms · 900ms p95

Median

P95

Tradeoffs

Failure modes & guardrails

Mitigation: When the client closes the browser tab, the server continues consuming LLM tokens (and paying for them) until the response ends. Fix: listen for the request's `AbortSignal` in the route handler (`req.signal.addEventListener('abort', ...)`) and cancel the LLM API call immediately on client disconnect. This can save 20-40% of LLM costs for interactive use cases.

Mitigation: Middleware that buffers the full stream response (e.g., logging middleware, compression middleware) negates streaming and causes OOM on long responses. Audit every middleware in the request pipeline and ensure `passThrough` mode is used. Disable gzip compression for SSE routes — browsers handle it fine over plain-text SSE.

Mitigation: LLM provider returns a 429 after the stream has started. At this point, the HTTP status code has already been sent to the client (200 OK). Solution: send a specially formatted SSE error event (`event: error, data: {code: 'rate_limit', retryAfter: 60}`) and close the stream. The client's error handler should display a 'rate limit' message and offer a retry button.

Mitigation: Corporate firewalls, proxies (nginx), and some CDNs buffer SSE responses until they reach a certain size or the connection closes. Fix for nginx: set `proxy_buffering off`, `X-Accel-Buffering: no` header. For Cloudflare: the `Cache-Control: no-cache` header bypasses buffering. Always test streaming through your full network stack, not just localhost.

View starter code →

Frequently asked questions

How do I implement streaming in a Next.js 16 App Router route handler?

Use the edge runtime with a ReadableStream response: `export const runtime = 'edge'`. In the route handler, create an Anthropic or OpenAI streaming client, pipe the stream through the Vercel AI SDK's `toAIStream()` helper, and return `new StreamingTextResponse(stream)`. The Vercel AI SDK handles SSE formatting, heartbeats, and error serialization automatically. Don't forget to set `Content-Type: text/event-stream` and `Cache-Control: no-cache` headers.

How do I handle errors that happen mid-stream?

Mid-stream errors are tricky because the HTTP 200 status has already been sent. The standard pattern: (1) catch the error in your stream transformer, (2) send a special SSE event: `event: error\ndata: {message: 'generation failed', code: 'provider_error'}\n\n`, (3) close the stream. On the client, add an `eventSource.addEventListener('error', ...)` handler (distinct from the `onerror` for connection errors) that displays the error and re-enables the submit button.

What's the maximum number of concurrent streaming connections I can handle on Vercel?

Vercel Edge Functions have no explicit concurrency limit, but each function execution has a 30-second limit and uses memory proportional to the response size in flight. In practice, 1,000-5,000 concurrent streams work fine on Vercel Pro. Above 10,000, you'll hit Vercel's function concurrency quotas (default: 1,000 concurrent executions per deployment). Solution: upgrade to Vercel Enterprise, or offload connections to a dedicated WebSocket server (Ably, Fly.io) that maintains long-lived connections independently.

Should I stream tool-use / function calling results?

Yes for better UX, but it's more complex. Anthropic's streaming API sends `input_json_delta` events while the model is building tool call arguments — you can render a 'Calling search tool...' indicator immediately. After the tool executes, stream the model's synthesis of results. The Vercel AI SDK's `streamObject()` and the `experimental_toolCallStreaming` option handle this pattern. For OpenAI, parse the `delta.tool_calls` chunks manually.

Architectures

LLM Function Calling & Tool Use: Production Architecture

Production patterns for LLM tool use: schema design, parallel tool calls, error handling when tools fail, resu...

Customer Support Agent

Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...

Customer Knowledge Base Chatbot

Reference architecture for a high-volume help-center chatbot over 10k support articles. Zendesk-style, cheap p...

Prompt Caching & Cost Optimization: 90% Savings on Repetitive Prompts

Architecture for Anthropic and OpenAI prompt caching: cache design patterns, minimum token thresholds, hit rat...

Multi-Agent Orchestration: Supervisor-Worker Architecture

A production pattern for coordinating multiple specialized LLM agents under a supervisor that delegates tasks,...

Models mentioned

claude-sonnet-4 gpt-4o gemini-2-flash claude-haiku-4

Tools mentioned

Vercel AI SDK Cloudflare Workers Upstash Redis EventSource API Ably