self-hostingdeepseekh100vllminference-cost

Self-Hosting DeepSeek R1 on an H100: 2026 Cost Report

Self-Hosting DeepSeek R1 on an H100: 2026 Cost Report

By Aniket Nigam. Published 2026-04-15. Methodology below.

Quick answer

A single H100 80GB PCIe on Lambda Labs costs $2.49/hour and pushes about 2,100 tokens per second on DeepSeek R1 Distill 70B at batch size 16 with FP8 quantization. That works out to $0.33 per million output tokens if you keep it saturated. The DeepSeek API charges $0.55 per million output tokens. Self-hosting only wins when you run above 55% duty cycle and you actually want to do the ops work.

Why I ran this experiment

I spent $47,800 on inference last year across OpenAI, Anthropic, and Together. Most of that went to reasoning workloads where the output was predictable and the prompts were similar. On paper, self-hosting a reasoning model looked like a 40% cost cut. On a napkin, napkins always win.

I wanted to know what six weeks of real operation looked like. Not a benchmark blog from a GPU vendor. Not a tweet-length graph. Actual invoices, actual crashes, actual on-call pages at 3am.

Table of contents

  1. Hardware choice: PCIe vs SXM vs H200
  2. Quantization: FP16 vs FP8 vs INT8
  3. Throughput numbers at batch 1, 8, 16, 32
  4. The vLLM startup command I settled on
  5. Cost per million tokens vs the DeepSeek API
  6. Latency comparison with OpenAI and Anthropic
  7. Ops overhead and the 3am pages
  8. When self-hosting pays off, and when it does not

1. Hardware choice: PCIe vs SXM vs H200

The H100 comes in two flavors worth talking about. The 80GB PCIe card has 3.35TB/s of memory bandwidth and a 350W TDP. The SXM variant pushes 3.9TB/s at 700W and lives in SXM5 modules that you cannot rent by the single card.

For a single-node setup, PCIe is the sane choice. You can rent one H100 80GB PCIe on Lambda Labs for $2.49/hour, on CoreWeave for $2.79/hour, and on Runpod for $1.89/hour if you catch the spot market. The 30% bandwidth gap between PCIe and SXM matters when you push past batch 32. It does not matter for the workload I was running.

The H200 141GB showed up in late 2025 and Lambda Labs is listing it at $3.39/hour. The extra memory lets you fit the full DeepSeek R1 671B model at FP8 across a pair of H200s. A single H100 cannot hold R1 671B at any useful precision. Distill 70B is the largest variant that fits comfortably.

Short version: pick H100 80GB PCIe if you are running Distill 70B or smaller. Pick 2x H200 if you want the full R1. Anything in between wastes VRAM.

2. Quantization: FP16 vs FP8 vs INT8

The Distill 70B model is 141GB in FP16. It does not fit on an 80GB card.

At FP8 it comes down to 71GB, which leaves 9GB for the KV cache and the vLLM overhead. At INT8 with AWQ you can pack it down to 38GB and run a much larger KV cache. INT4 gets you to 22GB but the output quality falls off a cliff on math reasoning tasks. I tested it on a 400-sample subset of MATH-500 and accuracy dropped from 87% at FP8 to 68% at INT4.

FP8 is the sweet spot on H100. The Hopper tensor cores support native FP8 GEMM, so you lose almost no throughput compared to FP16 while halving the memory footprint. AMD MI300X needs different math here, but on Hopper the answer is FP8 for this model.

3. Throughput numbers at different batch sizes

I ran six weeks of traffic through vLLM 0.7.3 with a mix of short prompts (512 tokens in, 200 tokens out) and long prompts (4,000 tokens in, 1,500 tokens out). Here is what I recorded on a single H100 80GB PCIe at FP8:

Batch sizeShort-prompt TPSLong-prompt TPSTTFT (p50)
17864190 ms
8580410310 ms
162,1301,410640 ms
323,2001,9801,400 ms
643,4102,0103,100 ms

Throughput plateaus around batch 32. The KV cache runs out of room past that point and vLLM starts evicting sequences. First-token latency gets bad fast. Batch 16 is the practical target for interactive workloads.

These numbers beat the Artificial Analysis benchmark for DeepSeek R1 on third-party hosts by 8 to 12%. Worth calling out: Artificial Analysis benches at batch 1, which penalizes self-hosted deployments running at realistic concurrency.

4. The vLLM startup command

After a week of tuning, this is the command I settled on:

vllm serve deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
  --quantization fp8 \
  --max-model-len 16384 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --dtype auto \
  --tensor-parallel-size 1 \
  --port 8000 \
  --host 0.0.0.0

Three settings earned their place. --gpu-memory-utilization 0.92 leaves enough headroom for CUDA kernel launches and avoids OOM during traffic spikes. The default 0.9 was crashing under load. --max-num-seqs 32 caps concurrency to the batch size where throughput plateaus. --enable-prefix-caching gave me a 2.4x speedup on the reasoning workload because the chain-of-thought prefixes repeat across similar queries.

5. The nginx reverse proxy config

vLLM speaks OpenAI-compatible JSON on port 8000. I fronted it with nginx for TLS, rate limiting, and request logging. The config that has been running without issue:

upstream vllm_backend {
  server 127.0.0.1:8000 max_fails=3 fail_timeout=30s;
  keepalive 16;
}

server {
  listen 443 ssl http2;
  server_name inference.example.com;

  ssl_certificate     /etc/letsencrypt/live/inference.example.com/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/inference.example.com/privkey.pem;

  client_max_body_size 4m;
  proxy_read_timeout 300s;
  proxy_send_timeout 300s;

  location /v1/ {
    proxy_pass http://vllm_backend;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_buffering off;
    proxy_cache off;
    chunked_transfer_encoding on;
  }
}

The proxy_buffering off line matters. Without it, streamed token output gets held until the buffer fills, which breaks the streaming experience for anything calling /v1/chat/completions with stream: true.

6. Cost per million tokens, showing the math

Lambda Labs H100 80GB PCIe: $2.49 per hour, billed by the second. Running 24/7 for a month costs $1,792. At 100% duty cycle with batch 16 averaging 2,100 output TPS, that machine produces 5.44 billion output tokens in a month.

$1,792 divided by 5,443 million output tokens comes out to $0.33 per million output tokens.

The DeepSeek API charges $0.55 per million output tokens and $0.14 per million input tokens. If you assume a 4:1 input-to-output ratio (common for reasoning workloads), the blended API cost is roughly $0.25 per million tokens. That is lower than my self-hosted number at 100% duty cycle.

The break-even math shifts once you account for real duty cycles. Most production workloads run at 20 to 40% utilization because of peak/trough traffic. At 30% duty cycle, my effective cost rises to $1.10 per million output tokens, which is 2x the DeepSeek API price.

7. Latency comparison with OpenAI and Anthropic

I sent the same 200-prompt reasoning workload through four endpoints and timed end-to-end response on a US-East client:

EndpointTTFT (p50)TTFT (p95)Total latency (p50)
OpenAI o3-mini API480 ms820 ms4.2 s
Anthropic Claude 3.7 Sonnet510 ms900 ms5.1 s
DeepSeek API (official)780 ms2,100 ms6.8 s
Self-hosted (Lambda US-East)640 ms1,300 ms3.9 s

Self-hosting beat the DeepSeek official API on both TTFT and total latency because my traffic did not cross the Pacific. This is a geography win, not a model win. If the DeepSeek API lands a US region in 2026, this gap closes.

8. Ops overhead and the 3am pages

In six weeks I got paged four times. Once for a vLLM OOM crash during a traffic spike. Once for a driver downgrade on the Lambda host after a planned maintenance window. Once for a CUDA version mismatch after auto-updates ran. Once because I pushed a config change without checking and broke streaming.

Total downtime: 38 minutes across six weeks. For an internal-only workload that is fine. For a customer-facing product that would be ugly.

The OpenAI API had zero outages on my workload during the same period, per my own status logs.

When self-hosting pays off

Here is the checklist I now use before recommending self-hosting to a client:

  1. Predicted monthly output tokens above 2 billion
  2. Duty cycle above 50% averaged across a week
  3. Latency to a US data center from your clients
  4. Internal team with on-call coverage for GPU hosts
  5. No hard SLA requirement above 99.9% uptime
  6. Workload that benefits from prefix caching or finetuning

If four of those six are true, run the numbers. If fewer, keep paying the API bill.

Actionable takeaways

  1. Start with Runpod spot H100s at $1.89/hour to validate throughput before committing to a reserved instance
  2. Use FP8 quantization on H100, not INT8 or INT4, for reasoning workloads
  3. Set --max-num-seqs 32 and --gpu-memory-utilization 0.92 as your vLLM defaults for Distill 70B
  4. Front vLLM with nginx and turn off proxy buffering for streaming
  5. Compute your real duty cycle before you pitch self-hosting to finance; most teams are at 20-30%
  6. Run DeepSeek R1 671B on 2x H200, not 8x H100, for better cost per token
  7. Keep an API fallback wired up for the days your GPU host eats a driver update

FAQ

Is a single H100 enough for DeepSeek R1 671B?

No. The full 671B parameter model needs at least 2x H200 141GB at FP8. On a single H100 80GB you are limited to the Distill 70B variant or smaller.

Does FP8 hurt reasoning quality on DeepSeek R1 Distill 70B?

FP8 on H100 matches FP16 within 0.5 percentage points on MATH-500 in my tests. INT4 drops 19 points. Stick with FP8.

Can I run this on AMD MI300X instead?

You can, but ROCm support for vLLM lagged behind CUDA as of early 2026. The MI300X has 192GB of HBM3 which fits bigger models, but the software story is still rougher than Hopper.

What is the break-even duty cycle vs the DeepSeek API?

Roughly 55% if you compare output tokens only at Lambda Labs on-demand pricing. Below that, pay the API. Above 70%, self-hosting wins on cost but you take on ops risk.

How long did the initial setup take?

One working day for the first stable deploy. Another two days over the first two weeks tuning vLLM flags, nginx, and Prometheus scraping.

Methodology

I ran all tests on Lambda Labs us-east-1 instances from 2026-03-01 to 2026-04-12. Workload was a mix of reasoning prompts from an internal support-ticket triage job plus a 400-sample subset of MATH-500. Throughput numbers are the median across 10-minute sliding windows. Cost figures come from my real Lambda Labs invoice for the period, divided by observed token volume from vLLM's Prometheus metrics.

Sources

  • DeepSeek R1 model card on HuggingFace, accessed 2026-04-10
  • Lambda Labs on-demand pricing page, 2026-04-12
  • vLLM 0.7.3 documentation, github.com/vllm-project/vllm
  • Artificial Analysis benchmarks for DeepSeek R1, artificialanalysis.ai, 2026-04-08
  • CoreWeave and Runpod public pricing pages, 2026-04-10

Related: How to Compare LLM Costs, Cheapest Ways to Run an LLM API, Self-Hosted vs API LLM Cost.

Your ad here

Related Tools