optimizationintermediate

LLM Latency Optimization: Faster AI Responses (2026)

Quick Answer

LLM latency has two parts: TTFT (processing input before first token) and total generation time (proportional to output length). The fastest technique: streaming hides total latency. The cheapest: use a smaller/faster model for latency-sensitive tasks. The most impactful for agents: prompt caching reduces TTFT by 85% on cache hits. Start with streaming (always), then optimize TTFT with caching, then reduce output length, then consider model substitution.

When to Use

  • User-facing features where response latency directly impacts user satisfaction (target: P95 < 3 seconds)
  • Voice applications requiring sub-second TTFT for natural conversation flow
  • Agent systems where 5+ sequential LLM calls compound latency into minutes
  • Competitive features where users compare your AI speed to alternative products
  • After initial deployment when user feedback identifies latency as a pain point

How It Works

  1. 1Streaming: always implement streaming for user-facing text generation. Even a 500ms delay before first token is invisible to users if the first token appears after 200ms. Streaming turns a 5-second experience into a streaming 0.2s TTFT experience.
  2. 2Model selection: smaller models are faster. Claude Haiku generates 3-4x faster than Claude Sonnet at 4x lower cost. For latency-sensitive features, benchmark all candidate models on your actual prompt distribution — theoretical speed ≠ actual speed on your prompts.
  3. 3Prompt caching: reduces TTFT by 85% on cache hits. For agents with large system prompts making repeated calls, caching converts 800ms TTFT to 120ms TTFT on cache hits. This is often the highest-impact latency optimization for agents.
  4. 4Reduce output tokens: output generation is proportional to token count. Instructing the model to 'be concise' and setting appropriate max_tokens limits reduces generation time. For 500-token vs 200-token outputs, you save 60% of generation time.
  5. 5Parallel calls: for agent tasks requiring multiple independent LLM calls, run them in parallel with asyncio.gather(). 5 sequential 1-second calls = 5 seconds; 5 parallel calls = 1 second. Identify all independent calls and parallelize them.

Examples

Latency benchmarking
import time
import statistics
from anthropic import Anthropic

client = Anthropic()

def benchmark_latency(model: str, prompt: str, n: int = 20) -> dict:
    ttft_times = []
    total_times = []
    
    for _ in range(n):
        start = time.time()
        first_token_time = None
        
        with client.messages.stream(
            model=model, max_tokens=200,
            messages=[{'role': 'user', 'content': prompt}]
        ) as stream:
            for text in stream.text_stream:
                if first_token_time is None:
                    first_token_time = time.time()
                    ttft_times.append(first_token_time - start)
        
        total_times.append(time.time() - start)
    
    return {
        'model': model,
        'ttft_p50': statistics.median(ttft_times),
        'ttft_p95': sorted(ttft_times)[int(0.95 * n)],
        'total_p50': statistics.median(total_times),
        'total_p95': sorted(total_times)[int(0.95 * n)]
    }
Output:Run this benchmark against your actual prompts. Typical results: Claude Haiku TTFT P50: 200ms, P95: 500ms. Claude Sonnet TTFT P50: 400ms, P95: 900ms. GPT-4o TTFT P50: 500ms, P95: 1200ms. Your results will vary based on prompt length and region.
Parallel independent LLM calls
import asyncio
import anthropic

client = anthropic.AsyncAnthropic()

async def parallel_analysis(document: str) -> dict:
    # These 3 analyses are independent — run in parallel
    async def summarize():
        r = await client.messages.create(
            model='claude-3-5-haiku-20241022', max_tokens=200,
            messages=[{'role': 'user', 'content': f'Summarize in 2 sentences: {document}'}]
        )
        return r.content[0].text
    
    async def extract_entities():
        r = await client.messages.create(
            model='claude-3-5-haiku-20241022', max_tokens=200,
            messages=[{'role': 'user', 'content': f'Extract key entities as JSON: {document}'}]
        )
        return r.content[0].text
    
    async def classify_sentiment():
        r = await client.messages.create(
            model='claude-3-5-haiku-20241022', max_tokens=10,
            messages=[{'role': 'user', 'content': f'Sentiment (positive/negative/neutral): {document}'}]
        )
        return r.content[0].text
    
    summary, entities, sentiment = await asyncio.gather(
        summarize(), extract_entities(), classify_sentiment()
    )
    return {'summary': summary, 'entities': entities, 'sentiment': sentiment}
Output:3 parallel calls vs. sequential: 300ms vs. 900ms (3x faster). All three use fast Haiku model. The asyncio.AsyncAnthropic client is required for concurrent calls.

Common Mistakes

  • Measuring average latency instead of P95 or P99 — users experience worst-case latency, not average. A P50 of 500ms with P99 of 8000ms creates terrible user experience even though the average looks acceptable. Always report P95 and P99.
  • Not streaming for user-facing text — users perceive streaming as much faster even when total time-to-complete is identical. Always implement streaming for any text generation the user reads directly.
  • Optimizing TTFT without measuring — TTFT is affected by: model size, input token count, server load, geographic proximity to API servers. Measure each factor separately to identify the bottleneck before optimizing.
  • Ignoring network latency — for applications deployed globally, using a US-based API endpoint adds 100-200ms for European users and 250-400ms for Asian users. Deploy with regional API endpoints or a CDN layer when serving users in multiple geographies.

FAQ

What's the fastest LLM available in 2026?+

For TTFT: Groq-hosted Llama 3.1 70B achieves sub-100ms TTFT and 800+ tokens/sec generation speed — the fastest inference available. For production API with quality: Claude Haiku is the fastest Anthropic model (~200ms TTFT), Gemini 1.5 Flash is the fastest Google model. For self-hosted: Ollama + Mistral 7B on a modern GPU achieves ~300ms TTFT locally.

How does output token count affect total latency?+

Most models generate 50-150 tokens/second. A 100-token response takes 0.7-2 seconds to generate after TTFT. A 500-token response takes 3-10 seconds. Reducing output length from 500 to 100 tokens saves 2-8 seconds of generation time. For latency-sensitive features, constrain output length aggressively.

Does geographic proximity to API servers matter?+

Yes — significantly. AWS us-east-1 to Anthropic API: ~20ms network. EU-west to same: ~100ms. Asia-Pacific: ~200ms. For TTFT of 300ms, network adds 7% (US) vs. 33% (EU) vs. 67% (APAC). Anthropic and OpenAI have regional endpoints; use them for your user geography.

What's speculative decoding and should I use it?+

Speculative decoding uses a small draft model to predict tokens, then verifies with the main model in parallel. It can double generation speed with no quality loss. It's implemented at the serving infrastructure level (not the API level). If you're self-hosting models, speculative decoding is highly recommended. Hosted APIs apply it internally — you benefit without configuration.

How do I reduce latency in multi-step agent pipelines?+

Three strategies: (1) Parallelize independent steps (asyncio.gather). (2) Cache intermediate results that repeat across calls (prompt caching for static context, application cache for dynamic data that's reused). (3) Use faster models for simple steps and reserve slow frontier models for complex reasoning steps. A 10-step agent making all calls sequentially on slow models can take 30+ seconds; with these optimizations, often under 5 seconds.

Related