Reference Architecture · generation

End-to-End Fine-Tuning Pipeline: From Data to Deployment

Q: When should I fine-tune instead of just improving my prompt?

Fine-tune when you've exhausted prompt engineering and still have consistent failure modes; when inference volume exceeds 5M tokens/mo (cost savings justify training cost); when you need sub-500ms latency that rules out GPT-4o; or when you need to prevent the model from seeing proprietary data in system prompts (fine-tuned behavior is baked in, not in the context window). Don't fine-tune for tasks with <100 examples — you'll overfit.

Q: How much training data do I actually need?

For LoRA fine-tuning: 200-500 high-quality examples is enough to teach consistent formatting and domain vocabulary. 1,000-5,000 examples for reliable behavior change. 10,000+ for near-full-fine-tuning quality with LoRA. More important than count: diversity (cover your tail cases) and quality (no noise, correct labels). One bad example per 100 is acceptable; one bad per 10 will hurt.

Q: What's the cheapest way to generate training data?

Synthetic data via Claude Sonnet 4 or GPT-4o is the most cost-effective: generate 10K examples for ~$10-30 using batch APIs. The risk is model collapse — training on AI-generated data degrades diversity over multiple generations. Hybrid approach works best: 20-30% human-curated seed examples, 70-80% synthetic generated from those seeds, then filtered by an LLM quality judge.

Q: How do I know if my fine-tune is actually better than the base model?

Create a golden evaluation set of 200-500 examples BEFORE training (never touch it during development). Use three metrics: (1) task-specific metric (F1, ROUGE, exact match), (2) LLM-as-judge score (1-5 rubric, Claude Haiku is cost-effective), (3) A/B test on 5-10% of live traffic. Require improvement on all three before promoting to production. A fine-tune that improves metric #1 but fails #3 is not ready.

Last updated: April 16, 2026

Quick answer

Fine-tune with LoRA on Llama 3.1 8B or Mistral 7B when you need consistent output format, domain-specific vocabulary, or reduced inference costs at scale. Expect 40-60% cost reduction vs GPT-4o for equivalent task quality on narrow tasks. Full pipeline from data to deployment: 1-2 weeks, $200-2,000 for training, $50-300/mo for inference hosting.

The problem

Teams fine-tune models for the wrong reasons or with bad data, wasting $500-5,000 per training run. Common failure modes: fine-tuning on <1,000 examples and expecting prompt-level quality, training on uncleaned data with 15-30% noise, and deploying without regression testing. Meanwhile, well-crafted few-shot prompts often match fine-tuned performance at zero training cost — making the ROI calculation critical before any training run.

Architecture

input

llm

data

infra

output

Raw Training Data

Source data: production logs, human-written examples, synthetic data from GPT-4o/Claude, or existing datasets. Minimum 500 high-quality examples for LoRA; 5,000+ for full fine-tuning. Quality matters far more than quantity.

Alternatives: Hugging Face datasets, Synthetic data via GPT-4o, Production logs, Human annotation (Scale AI, Labelbox)

Data Cleaning & Deduplication

Remove duplicates (MinHash LSH), filter low-quality examples (perplexity filtering), detect and remove PII, and validate format consistency. Expect to discard 10-30% of raw data. A dirty dataset is the #1 cause of fine-tuning failure.

Alternatives: Cleanlab, Argilla, LLM-as-judge filtering

Data Formatter (Chat Template)

Converts cleaned data into the model's expected chat format: system/user/assistant turns with correct special tokens. Each base model has a unique template — mismatched templates are a silent killer of fine-tune quality.

Alternatives: LiteLLM format converter, Custom Jinja2 templates

Training Engine

Runs the actual LoRA or full fine-tuning job. Manages gradient checkpointing, mixed precision (bf16), batch size, learning rate scheduling, and checkpoint saving. LoRA trains adapter weights only — 10-100x fewer parameters than full fine-tuning.

Alternatives: HuggingFace TRL + PEFT, Axolotl, LLaMA-Factory, OpenAI fine-tuning API

Base Model

The pre-trained model being fine-tuned. Smaller models (7-8B) are cheaper to train and host but have lower ceiling quality. Larger models (70B) approach GPT-4o quality but cost 10x more to run.

Alternatives: Mistral 7B Instruct, Llama 3.1 70B, Qwen 2.5 7B, Phi-4 (14B)

Evaluation Harness

Automated evaluation suite: task-specific metrics (ROUGE, BLEU for generation; F1/accuracy for classification), LLM-as-judge for qualitative assessment, and regression tests against a golden test set. Run evals after every checkpoint.

Alternatives: HELM, Braintrust, Promptfoo

Model Serving Infrastructure

Hosts the fine-tuned model for inference. Options range from managed APIs (Together AI, Fireworks AI) to self-hosted vLLM on GPU instances. LoRA adapters can be merged into the base model or served dynamically with LoRAX.

Alternatives: Fireworks AI, Replicate, Modal, Hugging Face Inference Endpoints, LoRAX for multi-adapter serving

Model Registry

Stores versioned model checkpoints and adapter weights. Tracks which data version produced which model version. Critical for rollback when a new fine-tune regresses on production traffic.

Alternatives: MLflow Model Registry, W&B Artifacts, AWS S3 + DVC

Fine-Tuned Model (Production)

The deployed fine-tuned model serving production traffic. Outputs structured, domain-consistent responses. Compare cost per 1M tokens vs GPT-4o equivalent to calculate ROI.

Alternatives: Merged LoRA weights, Base model + dynamic LoRA adapter

The stack

Base Model SelectionLlama 3.1 8B Instruct (Meta)

Llama 3.1 8B Instruct is the sweet spot: strong instruction-following baseline, 128K context window, Apache 2.0 license (commercial-safe), and $0.10-0.20/M tokens on managed APIs vs $5-15/M for GPT-4o. LoRA on 8B typically reaches 90-95% of GPT-4o quality on narrow tasks.

Alternatives: Mistral 7B Instruct v0.3, Qwen 2.5 7B, Phi-4 14B, Llama 3.1 70B for quality-critical tasks

Training FrameworkUnsloth + HuggingFace TRL

Unsloth provides 2x faster training and 50% less VRAM usage vs standard HF Trainer via custom CUDA kernels. On an A100 80GB, Unsloth trains 8B LoRA at ~3,000 tokens/second vs ~1,500 tokens/second vanilla. This halves your GPU costs.

Alternatives: Axolotl, LLaMA-Factory, OpenAI fine-tuning API (GPT-4o mini), Google Vertex AI fine-tuning

LoRA Configurationr=16, alpha=32, dropout=0.05, target_modules=[q_proj, k_proj, v_proj, o_proj]

r=16 is the standard starting point — enough parameter capacity for task-specific adaptation without overfitting on <10K examples. QLoRA (4-bit NF4) enables 7B training on a single 24GB GPU (RTX 4090) at the cost of 5-10% quality degradation vs full-precision LoRA.

Alternatives: r=8 (smaller, faster), r=64 (more expressive), QLoRA (4-bit quantized for VRAM-constrained training)

Training ComputeLambda Labs A100 80GB (on-demand: $1.99/hr) or vast.ai

An A100 80GB trains 8B LoRA on 5K examples in ~45 minutes (~$1.50). Full fine-tuning of 8B needs 2x A100s and takes 3-4 hours ($8-12). For 70B LoRA, use 4x A100s (~$8/hr). Lambda Labs is 30-50% cheaper than AWS for spot-equivalent on-demand GPU.

Alternatives: RunPod H100 ($3.99/hr), AWS p4d.24xlarge ($32/hr, overkill for 8B), Google Colab Pro+ (A100, $50/mo)

EvaluationCustom task-specific metrics + LLM-as-judge (Claude Haiku 4)

Generic benchmarks (MMLU, HellaSwag) don't predict task-specific fine-tune quality. Build a 200-example golden test set with expected outputs. Use Claude Haiku 4 as judge ($0.80/M tokens) to score outputs on a 1-5 rubric. LLM-as-judge correlates 0.85+ with human ratings on generation tasks.

Alternatives: ROUGE-L, BERTScore, Braintrust, Promptfoo, HELM benchmarks

Model ServingTogether AI (managed) for <1M requests/mo; vLLM on Modal for >1M

Together AI serves Llama 3.1 8B at $0.20/M tokens with no infra overhead — break-even vs self-hosting at ~2M tokens/day. Above that, vLLM on Modal with A10G instances ($0.10/M tokens at 60% GPU utilization) halves serving cost. vLLM enables PagedAttention — 2-4x higher throughput vs naive serving.

Alternatives: Fireworks AI, Replicate, Hugging Face Inference Endpoints, LoRAX for multi-adapter serving

Experiment TrackingWeights & Biases (wandb)

wandb captures loss curves, learning rate schedules, gradient norms, and eval metrics in real-time with minimal config (`wandb.init()` in training script). The training diff view lets you compare two runs in 30 seconds. Free tier is sufficient for most fine-tuning projects.

Alternatives: MLflow, Comet ML, TensorBoard

Cost at each scale

Prototype

1 training run + 50K inference tokens/mo

$85/mo

A100 GPU time (1 run, 8B LoRA on 1K examples, ~30min)$1

Data annotation / cleaning (manual, 1K examples)$0

Synthetic data generation via GPT-4o (500 examples)$4

Together AI inference (50K tokens/mo at $0.20/M)$0

W&B free tier$0

Developer time (10 hrs at $8/hr amortized)$80

Growth

2 training runs/mo + 10M inference tokens/mo

$520/mo

GPU compute (2x A100 runs, 5K examples each, 1.5hrs total)$6

Scale AI data annotation (500 examples/mo)$200

Together AI inference (10M tokens at $0.20/M)$2

Eval pipeline (Claude Haiku 4 judge, 5K eval calls)$12

W&B Teams plan$50

Hugging Face Hub private repo (3 seats)$25

Engineering time (amortized)$225

Scale

Weekly retraining + 500M inference tokens/mo

$6,800/mo

GPU compute (weekly LoRA retraining, 4x A100 hrs/mo)$32

Continuous data collection + annotation pipeline$800

vLLM on Modal A10G (500M tokens at ~$0.10/M)$50

ML platform infra (self-hosted MLflow, monitoring)$400

MLOps engineering (20% of 1 FTE)$5,500

Latency budget

Total P50: 300ms

Total P95: 900ms

Total

300ms · 900ms p95

Median

P95

Tradeoffs

Failure modes & guardrails

Mitigation: When fine-tuning erases the model's general capabilities (the model becomes narrow and brittle). Prevention: use LoRA (modifies only adapter weights), keep learning rate low (1e-4 to 3e-4), and include 5-10% general instruction-following examples in your training mix alongside task-specific data.

Mitigation: Fine-tuned model performs well on test set but fails in production because production inputs differ from training examples. Collect 10-20% of training data from actual production traffic (anonymized). Re-evaluate on production samples weekly after deployment.

Mitigation: Detectable by val loss diverging from train loss after epoch 1-2. Set early stopping with patience=3 on validation loss. For <1,000 examples, keep LoRA rank r≤8 and train for 3-5 epochs maximum. Add dropout (0.05-0.1) to LoRA adapters.

Mitigation: Your test set accidentally contains near-duplicates of training examples, inflating evals. Run MinHash deduplication across the combined train+test set before splitting. Use a 80/10/10 train/val/test split with a separate held-out production evaluation set.

View starter code →

Frequently asked questions

When should I fine-tune instead of just improving my prompt?

Fine-tune when you've exhausted prompt engineering and still have consistent failure modes; when inference volume exceeds 5M tokens/mo (cost savings justify training cost); when you need sub-500ms latency that rules out GPT-4o; or when you need to prevent the model from seeing proprietary data in system prompts (fine-tuned behavior is baked in, not in the context window). Don't fine-tune for tasks with <100 examples — you'll overfit.

How much training data do I actually need?

For LoRA fine-tuning: 200-500 high-quality examples is enough to teach consistent formatting and domain vocabulary. 1,000-5,000 examples for reliable behavior change. 10,000+ for near-full-fine-tuning quality with LoRA. More important than count: diversity (cover your tail cases) and quality (no noise, correct labels). One bad example per 100 is acceptable; one bad per 10 will hurt.

What's the cheapest way to generate training data?

Synthetic data via Claude Sonnet 4 or GPT-4o is the most cost-effective: generate 10K examples for ~$10-30 using batch APIs. The risk is model collapse — training on AI-generated data degrades diversity over multiple generations. Hybrid approach works best: 20-30% human-curated seed examples, 70-80% synthetic generated from those seeds, then filtered by an LLM quality judge.

How do I know if my fine-tune is actually better than the base model?

Create a golden evaluation set of 200-500 examples BEFORE training (never touch it during development). Use three metrics: (1) task-specific metric (F1, ROUGE, exact match), (2) LLM-as-judge score (1-5 rubric, Claude Haiku is cost-effective), (3) A/B test on 5-10% of live traffic. Require improvement on all three before promoting to production. A fine-tune that improves metric #1 but fails #3 is not ready.

Architectures

Automated LLM Evaluation Harness: CI/CD for AI Quality

A production evaluation system for LLMs covering test dataset management, LLM-as-judge scoring, regression tes...

Text-to-SQL Agent

Reference architecture for translating natural-language questions into safe, correct SQL. Schema-aware prompti...

Intent Classification for Message Routing

Reference architecture for multi-label intent classification routing inbound customer messages to the right te...

Realtime Content Moderation Pipeline

Reference architecture for moderating user-generated text and images in realtime. Tiered policy classifier, hu...

Prompt Caching & Cost Optimization: 90% Savings on Repetitive Prompts

Architecture for Anthropic and OpenAI prompt caching: cache design patterns, minimum token thresholds, hit rat...

Models mentioned

llama-3-1-8b mistral-7b gpt-4o-mini claude-haiku-4

Tools mentioned

Unsloth HuggingFace TRL Together AI vLLM Weights & Biases Axolotl