optimizationadvanced

Model Routing: Sending Queries to the Right LLM (2026)

Quick Answer

Model routing classifies each incoming query by complexity and routes it to the appropriate model tier. Simple queries (factual lookups, short classifications) go to fast cheap models (Haiku, GPT-4o-mini). Complex queries (multi-step reasoning, code generation, nuanced analysis) go to frontier models. Routing itself should use a fast cheap model or a trained classifier. Well-designed routing cuts costs 50–80% while maintaining quality on 95%+ of queries.

When to Use

✓High-volume applications where the query complexity distribution is mixed (60%+ are simple)
✓When cost is the primary constraint but you can't sacrifice quality on complex queries
✓Multi-model architectures where different models have complementary strengths
✓Building an AI gateway layer that serves multiple downstream applications
✓When you want to route to different providers based on cost, latency, or capability

How It Works

1Complexity classification: use a fast cheap model (or a fine-tuned classifier) to score incoming queries on complexity (1-5 scale) before routing. This classification call costs ~$0.001 — negligible compared to routing savings.
2Define routing tiers: Tier 1 (cheap/fast): simple factual lookup, short classification, sentiment, extraction. Tier 2 (medium): summarization, basic Q&A, code completion. Tier 3 (frontier): multi-step reasoning, code generation, complex analysis.
3RouteLLM (open-source, 2024) trains a router model on quality labels to predict which routing decisions achieve target quality. Achieves 40% cost reduction with 95%+ quality retention on standard benchmarks.
4Cascade routing: try a cheap model first; if confidence is below threshold, escalate to a stronger model. This is conservative but effective — you only pay for the expensive model when the cheap one fails.
5Monitor routing accuracy: compare routed outputs to frontier model outputs on a sample (1–5%) of traffic. If small-model quality on 'simple' queries falls below threshold, either tighten routing criteria or fine-tune the classifier.

Examples

Rule-based routing

from anthropic import Anthropic

client = Anthropic()

def route_query(query: str, context: dict = {}) -> str:
    query_lower = query.lower()
    query_len = len(query.split())
    
    # Heuristic routing rules
    simple_signals = [
        query_len < 15,  # Short query
        any(kw in query_lower for kw in ['what is', 'define', 'how many', 'when was', 'who is']),
        context.get('task_type') in ['classification', 'extraction', 'sentiment']
    ]
    
    complex_signals = [
        query_len > 50,
        'debug' in query_lower or 'implement' in query_lower or 'architect' in query_lower,
        context.get('requires_reasoning') == True
    ]
    
    if sum(complex_signals) >= 2:
        return 'claude-3-5-sonnet-20241022'
    elif sum(simple_signals) >= 2:
        return 'claude-3-5-haiku-20241022'
    else:
        return 'claude-3-5-haiku-20241022'  # Default to cheap, escalate on failure

# Use routing in your pipeline
model = route_query(user_query, context={'task_type': 'classification'})
response = client.messages.create(model=model, ...)

Output:Simple heuristic routing. No extra LLM call needed. Classify based on query length, keywords, and task type from application context. Fast (microseconds), free, and surprisingly effective for 60-70% of routing decisions.

Cascade routing with confidence check

async def cascade_route(query: str, system: str, confidence_threshold: float = 0.85) -> str:
    # Try cheap model first
    haiku_response = await client.messages.create(
        model='claude-3-5-haiku-20241022',
        max_tokens=1024,
        system=system,
        messages=[{'role': 'user', 'content': query}]
    )
    
    # Ask Haiku to self-assess confidence
    confidence_check = await client.messages.create(
        model='claude-3-5-haiku-20241022',
        max_tokens=10,
        messages=[{
            'role': 'user',
            'content': f'On a scale 0-1, how confident are you in this answer? Query: {query}\nAnswer: {haiku_response.content[0].text}\n\nReply with only a number.'
        }]
    )
    
    confidence = float(confidence_check.content[0].text.strip())
    
    if confidence >= confidence_threshold:
        return haiku_response.content[0].text
    else:
        # Escalate to Sonnet
        sonnet_response = await client.messages.create(
            model='claude-3-5-sonnet-20241022', max_tokens=1024,
            system=system, messages=[{'role': 'user', 'content': query}]
        )
        return sonnet_response.content[0].text

Output:Cascade adds one extra Haiku call for confidence checking (~$0.001). If 70% of queries have confidence > 0.85 on Haiku, you avoid Sonnet cost on 70% of traffic. Net savings: ~60% with full escalation path for complex queries.

Common Mistakes

✗Routing to cheap models without measuring quality degradation — assume the cheap model will be worse on some queries; measure how much worse. Run 100 queries through both models and compare with LLM-as-judge. The acceptable degradation level depends on your use case.
✗Overcomplicating the routing logic — a 3-tier routing system based on query length and 5 keywords often outperforms a complex ML classifier if your query distribution is predictable. Start simple, add complexity only when simple routing has measurable quality gaps.
✗Not logging routing decisions — always log which model was used for each query. This data is essential for diagnosing quality issues ('all complaints came from Haiku-routed queries') and improving routing accuracy over time.
✗Ignoring provider-level routing — model routing isn't just about cheap vs. expensive; it's also about capability. Some models are better at code (Claude), others at math (Gemini), others at following complex instructions (GPT-4o). Route to the best model for each task type, not just the cheapest.

FAQ

What is RouteLLM and how does it work?+

RouteLLM (Lianmin Zheng et al., 2024) is an open-source framework that trains a small router model to predict whether a query needs a strong or weak model. The router is trained on quality labels (did the weak model's answer match the strong model's?). On standard benchmarks, RouteLLM achieves 40% cost reduction while maintaining 95% of frontier model quality.

Can I route between providers, not just models?+

Yes — provider routing makes sense when: Anthropic is faster for some queries, OpenAI is cheaper for others, or you need fallback to another provider for reliability. Services like LiteLLM and Martian's LLM router support multi-provider routing. Vercel AI Gateway provides unified routing with automatic failover.

How do I handle routing for long conversations?+

Route based on the full conversation complexity, not just the latest message. A simple message in a complex conversation still needs the frontier model that has context of the full exchange. Track conversation-level complexity state: if any prior message was routed to a frontier model, route subsequent messages to the same model for consistency.

What's the right routing granularity?+

Two or three tiers works well for most applications. More than three tiers adds routing complexity without proportional savings — the differences between adjacent tiers become smaller. Two tiers (cheap + expensive) with cascade is simple and captures 80% of the savings. Add a third tier (medium) if you have a clear price-quality point that fills a gap.

How does model routing interact with prompt caching?+

They interact in an important way: if you route 70% of queries to Haiku and 30% to Sonnet, you have two separate cache pools. The Haiku cache won't help Sonnet queries and vice versa. Design your system prompt and context to be cacheable for each model tier independently. High cache hit rates benefit from routing stability — sending the same user consistently to the same model improves cache utilization.

cost optimization latency optimization a b model testing prompt caching ↗ prompt caching cost optimization ↗ streaming response pipeline

Model Routing: Sending Queries to the Right LLM (2026)

When to Use

How It Works

Examples

Common Mistakes

FAQ

Related