agentsintermediate

Streaming LLM Responses: Reducing Perceived Latency (2026)

Quick Answer

Streaming uses Server-Sent Events (SSE) to push tokens to the client as the model generates them. Instead of waiting 5 seconds for a complete response, users see tokens appearing within 300ms. Streaming is essential for any UX requiring long-form output. The tradeoff: streaming complicates error handling and requires the client to handle partial responses.

When to Use

✓Any user-facing chat or generation interface where output takes more than 1 second to generate
✓Long-form content generation (documents, code, reports) where the user benefits from seeing progress
✓Agentic workflows where you want to stream the agent's reasoning steps as they occur
✓APIs serving interactive frontends where perceived latency directly impacts user satisfaction
✓Cost monitoring pipelines that need token counts in real-time rather than after completion

How It Works

1Enable streaming in the API call: stream=True (OpenAI) or with client.messages.stream() (Anthropic). The API returns a generator/stream object instead of a complete response.
2The response comes as a series of events, each containing a small delta of new text. Accumulate deltas into a buffer for the final complete response.
3For tool use with streaming, the model first streams its reasoning text, then sends a structured tool_use event (non-streamed). Pause streaming, execute the tool, then resume streaming the response.
4Implement proper stream error handling: connections can drop mid-stream. Detect incomplete streams and either retry or present the partial result with a clear indication of incompleteness.
5For Next.js/Vercel deployments, use the AI SDK's streamText() with the Response object — it handles SSE properly in edge and serverless functions where raw streaming can be tricky.

Examples

Anthropic streaming with Python

import anthropic

client = anthropic.Anthropic()

# Method 1: Context manager (recommended)
with client.messages.stream(
    model='claude-3-5-sonnet-20241022',
    max_tokens=1024,
    messages=[{'role': 'user', 'content': 'Write a detailed analysis of LLM pricing trends.'}]
) as stream:
    for text in stream.text_stream:
        print(text, end='', flush=True)
    
    # Get final usage stats after stream completes
    final_message = stream.get_final_message()
    print(f'\n\nInput tokens: {final_message.usage.input_tokens}')
    print(f'Output tokens: {final_message.usage.output_tokens}')

Output:Tokens stream to stdout as generated. Time to first token: ~200-400ms. The stream context manager handles connection cleanup and returns final usage stats after completion.

Next.js streaming with Vercel AI SDK

// app/api/chat/route.ts
import { streamText } from 'ai'
import { anthropic } from '@ai-sdk/anthropic'

export async function POST(req: Request) {
  const { messages } = await req.json()
  
  const result = streamText({
    model: anthropic('claude-3-5-sonnet-20241022'),
    messages,
    system: 'You are a helpful AI assistant.',
  })
  
  return result.toDataStreamResponse()
}

// app/chat/page.tsx (client component)
'use client'
import { useChat } from 'ai/react'

export default function Chat() {
  const { messages, input, handleInputChange, handleSubmit } = useChat()
  return (
    <div>
      {messages.map(m => <div key={m.id}>{m.content}</div>)}
      <form onSubmit={handleSubmit}>
        <input value={input} onChange={handleInputChange} />
        <button type='submit'>Send</button>
      </form>
    </div>
  )
}

Output:Full streaming chat in ~20 lines. useChat handles SSE parsing, message accumulation, and state updates automatically. Works with Vercel edge functions for minimal latency.

Common Mistakes

✗Not handling stream interruption — network issues can terminate a stream mid-response. Always implement a fallback: catch the interrupted stream error, optionally retry, and return whatever partial content was received with a 'response was cut off' indicator.
✗Streaming tool use responses — tool calls within a streamed response need special handling. The tool_use block arrives as a complete event, not as streaming deltas. Pause your streaming display when a tool_use event arrives, execute the tool, then resume.
✗Blocking SSE with server-side middleware — some authentication or logging middleware reads the full response body before passing it to the client, destroying the streaming behavior. Ensure middleware is stream-compatible.
✗Not implementing loading indicators — even with streaming, there's a 200–400ms gap before the first token. Show a loading indicator or skeleton state before the first token arrives to avoid a flash of empty content.

FAQ

Does streaming reduce cost?+

Streaming has no effect on token cost — you're billed for the same tokens regardless of streaming. However, streaming reduces perceived latency significantly, which can reduce user abandonment (which has indirect cost implications). The token count is identical for streamed vs. non-streamed responses.

How do I implement streaming in a serverless function?+

Use the AI SDK's toDataStreamResponse() or the native Response object with a ReadableStream. Vercel Edge Functions support streaming natively. AWS Lambda requires Response Streaming mode (available since 2023). Standard serverless functions that return a complete response object don't support streaming.

Can I cancel a stream mid-generation?+

Yes — in the Anthropic SDK, call stream.abort(). In OpenAI, cancel the underlying HTTPs request. In the browser, use AbortController. Canceling stops generation and billing — you're only billed for tokens already generated when you cancel.

How do I handle streaming with tool calls?+

Claude and OpenAI emit tool call events differently in streaming mode. In Claude streaming, tool_use blocks come as input_json_delta events that you accumulate. In OpenAI, function calls come as delta.tool_calls. The pattern: accumulate the tool call JSON, execute after the full tool_use block is received, then send the tool result in a new message to continue the stream.

What's the difference between streaming and WebSockets for LLM responses?+

SSE (Server-Sent Events) is one-directional and the standard for LLM streaming — simpler to implement, works over standard HTTP, and perfectly suited for unidirectional model output. WebSockets are bidirectional and needed for features like voice (continuous audio input while streaming text output) or real-time multi-user collaboration. Use SSE for standard chat, WebSockets for voice or collaborative applications.

prompt caching latency optimization tool use cost optimization ↗ streaming response pipeline ↗ meeting notetaker agent ↗ email triage agent

Streaming LLM Responses: Reducing Perceived Latency (2026)

When to Use

How It Works

Examples

Common Mistakes

FAQ

Related