evaluationadvanced

A/B Model Testing: Comparing LLMs in Production (2026)

Quick Answer

A/B model testing splits production traffic between two LLM configurations (different model, prompt, or retrieval strategy) and measures which produces better outcomes — higher user satisfaction, task completion, or conversion. It's the gold standard for validating LLM changes because offline evals don't always predict real user behavior. Requires minimum 500–1000 queries per variant for statistical significance.

When to Use

  • Validating a major model upgrade (e.g., switching from Claude Haiku to Claude Sonnet) on real traffic before full rollout
  • Testing a new prompt strategy that offline evals show is better, but you need production confirmation
  • Comparing two RAG architectures on real user queries with distribution you can't fully replicate in offline evals
  • Measuring user satisfaction impact of quality improvements (do users actually prefer the better-scoring model?)
  • Making cost-quality tradeoffs — testing whether a cheaper model achieves equivalent user outcomes

How It Works

  1. 1Randomized assignment: assign each user session or request to variant A or B using a hash function. Maintain consistency — the same user should always see the same variant within an experiment. Use a feature flag system (LaunchDarkly, Unleash) rather than manual routing.
  2. 2Define a primary metric before starting: task completion rate, user thumbs up/down, response acceptance rate, session length, or conversion. Secondary metrics can include cost, latency, and LLM-as-judge scores.
  3. 3Run the experiment long enough: calculate required sample size using power analysis. For detecting a 5% improvement with 80% power and 5% significance level: typically 1,000–2,000 samples per variant. Use an online calculator.
  4. 4Monitor for novelty effect and experiment contamination: users may interact differently with a system they perceive as new. Run for at least 2 weeks to average out novelty effects.
  5. 5Analyze results with proper statistics: t-test for continuous metrics, chi-squared for conversion rates. Report confidence intervals, not just p-values. p < 0.05 with no meaningful effect size is still not actionable.

Examples

A/B test routing with feature flags
import hashlib
from anthropic import Anthropic

client = Anthropic()

def get_variant(user_id: str, experiment_id: str) -> str:
    # Deterministic assignment — same user always gets same variant
    hash_input = f'{user_id}:{experiment_id}'
    hash_val = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    return 'A' if hash_val % 100 < 50 else 'B'

async def llm_call_with_ab(user_id: str, query: str) -> dict:
    variant = get_variant(user_id, experiment_id='model_upgrade_2026_04')
    
    model = 'claude-3-5-haiku-20241022' if variant == 'A' else 'claude-3-5-sonnet-20241022'
    
    response = client.messages.create(
        model=model, max_tokens=1024,
        messages=[{'role': 'user', 'content': query}]
    )
    
    # Log for analysis
    log_ab_event(user_id=user_id, variant=variant, model=model,
                 query=query, response=response.content[0].text,
                 input_tokens=response.usage.input_tokens,
                 output_tokens=response.usage.output_tokens)
    
    return {'response': response.content[0].text, 'variant': variant}
Output:50/50 deterministic split by user_id. All interactions logged with variant label for analysis. Cost and quality metrics tracked per variant automatically from usage data.
Statistical analysis of A/B results
from scipy import stats
import pandas as pd

def analyze_ab_results(experiment_data: pd.DataFrame) -> dict:
    variant_a = experiment_data[experiment_data['variant'] == 'A']
    variant_b = experiment_data[experiment_data['variant'] == 'B']
    
    # Primary metric: user satisfaction (thumbs up rate)
    satisfaction_a = variant_a['thumbs_up'].mean()
    satisfaction_b = variant_b['thumbs_up'].mean()
    
    # Chi-squared test for proportions
    contingency = pd.crosstab(experiment_data['variant'], experiment_data['thumbs_up'])
    chi2, p_value, dof, _ = stats.chi2_contingency(contingency)
    
    # Cost analysis
    cost_a = (variant_a['input_tokens'] * 0.80 + variant_a['output_tokens'] * 4.00).mean() / 1e6
    cost_b = (variant_b['input_tokens'] * 3.00 + variant_b['output_tokens'] * 15.00).mean() / 1e6
    
    return {
        'satisfaction_a': satisfaction_a, 'satisfaction_b': satisfaction_b,
        'lift': (satisfaction_b - satisfaction_a) / satisfaction_a,
        'p_value': p_value, 'significant': p_value < 0.05,
        'cost_per_query_a': cost_a, 'cost_per_query_b': cost_b,
        'cost_increase': (cost_b - cost_a) / cost_a,
        'recommendation': 'deploy B' if p_value < 0.05 and satisfaction_b > satisfaction_a else 'keep A'
    }
Output:Combines statistical significance test with cost/quality tradeoff analysis. 'Deploy B' only if improvement is significant AND the cost increase is justified by the satisfaction lift.

Common Mistakes

  • Stopping the experiment early when results look good — peeking at results and stopping when p < 0.05 inflates false positive rates. Pre-commit to a minimum sample size and run time before analyzing.
  • Not controlling for confounders — user segments may respond differently to LLM changes. Analyze results by segment (new vs. returning users, query types, time of day). A model that performs better overall may hurt specific important segments.
  • Testing too many variants simultaneously — each additional variant reduces the sample per variant and increases the chance of false positives. Run at most 2–3 variants per experiment. Sequential testing (A/A first for calibration, then A/B) is more reliable than multi-variant tests.
  • Using implicit metrics without validating they measure what you think — 'session length' sounds good but longer sessions could mean users are confused, not satisfied. Validate metric choices with qualitative user research before using them as primary signals.

FAQ

How long should I run an A/B test?+

Minimum: reach statistical significance on your primary metric (calculate with a power analysis calculator). Minimum time: 2 weeks to average out day-of-week and novelty effects, even if you hit significance sooner. Maximum: 8 weeks — beyond that, external factors (seasonality, product changes) confound the results.

What if my A/B test shows no significant difference?+

A null result is a valid result — it means the change doesn't meaningfully affect user outcomes (positive or negative). Check if the experiment was adequately powered (enough samples). If yes, accept the null result and consider: is the quality improvement measurable by offline evals but invisible to users? If so, the offline metric may not matter.

Can I A/B test without randomized assignment?+

Yes — time-based testing (run A this week, B next week) is simpler to implement but confounds results with weekly cycles and external events. Quasi-experimental methods (difference-in-differences) can partially account for this. Randomized assignment is more robust whenever possible.

How do I handle A/B testing for multi-turn conversations?+

Assign at the conversation/session level, not the message level. A user must experience the same model throughout a conversation — mixing models mid-conversation creates inconsistency. Track conversation-level outcomes (task completed, user rated conversation positively) rather than per-message metrics.

What's the minimum A/B test infrastructure I need?+

Minimum viable: (1) deterministic hashing for variant assignment, (2) event logging with variant label, (3) a database or data warehouse to query results. You can build this in a day. Full-featured: feature flag system (LaunchDarkly), experiment management dashboard (Statsig, Amplitude Experiment), automated stopping rules. Start minimal and add tooling as you run more experiments.

Related