Question 1

What is the break-even point for self-hosting vs API?

Accepted Answer

Typically $15,000–$25,000/month in API spend, assuming 2 ML engineers at $150K/year fully dedicated to the project. The break-even calculation must include: GPU costs, engineering time, monitoring, incident response, and model upgrade cycles. Most teams underestimate the total cost of ownership by 2–3x.

Question 2

What is Together AI / Replicate / Modal?

Accepted Answer

These are 'managed open-source' platforms — you get open models (Llama, Mistral, DeepSeek) served via API without managing your own infrastructure. Together AI runs Llama 3.3 70B at $0.88/M tokens, which is 40% cheaper than GPT-4o mini. It's the best middle ground for teams that want cost efficiency without GPU infrastructure.

Question 3

Can Azure OpenAI or AWS Bedrock satisfy data residency requirements?

Accepted Answer

Often yes. Azure OpenAI Service processes data within your Azure region and supports HIPAA BAAs, GDPR, and most government compliance frameworks. AWS Bedrock similarly keeps data within your AWS account. These are the first options to explore before investing in self-hosted infrastructure.

Question 4

What models can be realistically self-hosted?

Accepted Answer

In 2026, practical self-hosting options include: Llama 3.3 70B (requires 4x A100 80GB), Mistral Large 2 (requires 2x A100 80GB), DeepSeek V3 (open-weight, near-frontier), and Qwen 2.5 72B (strong multilingual). Anything requiring 8+ A100s (like Llama 3.1 405B) is rarely economical to self-host.

Self-Host vs LLM API: Should You Run Your Own Model?

What is driving your interest in self-hosting?

FAQ

Related Tools