Langfuse vs Braintrust vs Helicone (2026): LLM Observability Tools Compared
You're flying blind without LLM observability. You don't know which prompts are failing, which model calls are expensive, or why your agent got stuck in a loop last Tuesday. Three tools have emerged as the leading options for LLM observability in production.
Quick Comparison
| Tool | Best For | Free Tier | Self-Host | Open Source |
| Langfuse | Tracing complex agents, open-source teams | 50K obs/month | ✓ | ✓ |
| Braintrust | Evals-first teams, prompt experiments | 1K rows/month | Limited | Partial |
| Helicone | Simple observability, proxy-based logging | 10K req/month | ✓ | ✓ |
Langfuse
What It Is
Langfuse is open-source LLM observability with a heavy focus on tracing — understanding the full execution flow of complex agentic systems. When your agent makes 15 API calls to complete a task, Langfuse shows you all 15 calls in a hierarchical trace, with latency, cost, and input/output at each step.
Pricing
| Plan | Price | Observations |
| Hobby | Free | 50K/month |
| Pro | $59/month | 500K/month |
| Team | $299/month | 5M/month |
| Enterprise | Custom | Unlimited |
| Self-hosted | Free | Unlimited |
Core Feature: Tracing
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
langfuse = Langfuse(
public_key="pk-lf-your-key",
secret_key="sk-lf-your-key"
)
@observe() # Automatically traces this function
def research_agent(query: str) -> str:
# First step: search
search_results = web_search(query)
langfuse_context.update_current_observation(
input={"query": query},
metadata={"step": "search"}
)
# Second step: analyze
analysis = analyze_results(search_results)
return analysis
@observe(name="web-search")
def web_search(query: str) -> list:
# ... implementation
pass
@observe(name="analyze")
def analyze_results(results: list) -> str:
# LLM call here is automatically traced
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": f"Analyze: {results}"}]
)
return response.content[0].text
# All calls are traced hierarchically in Langfuse
result = research_agent("What are the latest LLM benchmarks?")
OpenAI Integration
from langfuse.openai import OpenAI # Drop-in replacement
client = OpenAI() # All calls automatically traced
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
# Langfuse metadata
name="greeting-handler",
user_id="user-123",
tags=["production", "chat"]
)
Evals in Langfuse
# Score individual traces
langfuse.score(
trace_id="trace-abc123",
name="quality",
value=0.9,
comment="Response was accurate and well-structured"
)
# Run automated evaluations
from langfuse import Langfuse
langfuse = Langfuse()
# Fetch recent traces for eval
traces = langfuse.fetch_traces(limit=100).data
for trace in traces:
# Run your eval function
score = evaluate_response(trace.output, trace.input)
langfuse.score(
trace_id=trace.id,
name="auto-eval",
value=score
)
Self-Hosting
# docker-compose.yml
version: "3"
services:
langfuse-server:
image: langfuse/langfuse:latest
environment:
DATABASE_URL: postgresql://postgres:password@db:5432/langfuse
NEXTAUTH_SECRET: your-secret
SALT: your-salt
ports:
- "3000:3000"
depends_on:
- db
db:
image: postgres:15
environment:
POSTGRES_PASSWORD: password
POSTGRES_DB: langfuse
volumes:
- postgres_data:/var/lib/postgresql/data
Langfuse Strengths
- Top-tier tracing for complex agents
- Fully open-source (MIT license)
- Self-hostable with Docker
- Strong LangChain/LangGraph integration
- Dataset management for evals
- Prompt versioning built in
Langfuse Weaknesses
- Evals tooling less mature than Braintrust
- UI can be overwhelming for simple use cases
- Self-hosting requires infrastructure investment
Braintrust
What It Is
Braintrust takes an evals-first approach. Where Langfuse starts with tracing and adds evals, Braintrust starts with experiments ("how does prompt A compare to prompt B?") and adds logging.
Pricing
| Plan | Price | Rows/month |
| Free | $0 | 1,000 |
| Team | $200/month | 100,000 |
| Enterprise | Custom | Unlimited |
Core Feature: Experiments
import braintrust
from braintrust import Eval
# Define your eval
async def run_experiment():
await Eval(
"summarization-quality",
data=lambda: [
{
"input": {"text": "Long article about quantum computing..."},
"expected": "Brief factual summary"
},
# more test cases...
],
task=lambda input: summarize(input["text"]), # Your function
scores=[
braintrust.LLMClassifier(
name="quality",
prompt_template="Rate this summary from 0-1: {{output}}",
choice_scores={"excellent": 1, "good": 0.7, "poor": 0.3}
),
braintrust.Factuality(), # Built-in factuality scorer
braintrust.ClosedQA() # Built-in QA scorer
]
)
# Run: braintrust eval my_evals.py
# See results in the Braintrust dashboard
Logging with Braintrust
from braintrust import init_logger, traced
logger = init_logger(project="my-app")
@traced # Automatically logged
def process_document(doc: str) -> str:
# LLM call
result = llm_call(doc)
return result
# Manual logging
with logger.start_span("my-operation") as span:
result = expensive_operation()
span.log(
input={"data": "..."},
output=result,
metadata={"model": "claude-sonnet-4-5", "cost": 0.003}
)
Prompt Versioning in Braintrust
from braintrust import load_prompt
# Prompts stored and versioned in Braintrust
prompt = load_prompt(project="my-app", slug="summarization-prompt")
response = openai_client.chat.completions.create(
**prompt.build(text="Article to summarize...") # Injects current prompt version
)
Braintrust Strengths
- Best eval workflow of any tool
- A/B testing prompts is the core use case
- Built-in LLM judges for factuality, QA, summarization
- Clean, focused UI
- Good TypeScript support
Braintrust Weaknesses
- Expensive for high-volume logging (logging is secondary to evals)
- Free tier very limited (1K rows)
- No self-hosting on free/team plans
- Less mature agent tracing than Langfuse
Helicone
What It Is
Helicone takes the simplest approach: it's a proxy that sits between your app and LLM providers, logging every request transparently. No SDK changes required — just change your base URL.
Pricing
| Plan | Price | Requests |
| Free | $0 | 10,000/month |
| Pro | $20/month | 100K/month |
| Growth | $100/month | 1M/month |
| Enterprise | Custom | Unlimited |
Setup: Zero Code Changes
# Before
client = openai.OpenAI(api_key="your-openai-key")
# After — only change the base URL
client = openai.OpenAI(
api_key="your-openai-key",
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": "Bearer pk-helicone-your-key",
"Helicone-User-Id": "user-123", # Optional: track per user
"Helicone-Property-feature": "chat" # Optional: custom properties
}
)
# All existing code works unchanged
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
# Claude via Helicone
client = anthropic.Anthropic(
api_key="your-anthropic-key",
base_url="https://anthropic.helicone.ai",
default_headers={
"Helicone-Auth": "Bearer pk-helicone-your-key"
}
)
Caching with Helicone
# Enable caching — identical requests return cached responses
client = openai.OpenAI(
api_key="your-key",
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": "Bearer pk-helicone-your-key",
"Helicone-Cache-Enabled": "true",
"Helicone-Cache-Bucket-Max-Size": "10",
"Cache-Control": "max-age=3600" # 1 hour cache
}
)
Rate Limiting Per User
# Limit users to 100 requests/day
default_headers={
"Helicone-Auth": "Bearer pk-helicone-your-key",
"Helicone-User-Id": current_user_id,
"Helicone-RateLimit-Policy": "100;w=86400;u=user" # 100 req per 86400s per user
}
Helicone Strengths
- Zero code changes to get started
- Works with any OpenAI-compatible provider
- Request caching out of the box
- Per-user rate limiting
- Clean, simple dashboard
- Cheapest paid plan ($20/month)
Helicone Weaknesses
- No agent tracing (it's request-level, not trace-level)
- Minimal eval capabilities
- Less suitable for complex multi-step workflows
- Adding a network hop (though they claim <10ms overhead)
Feature detailed look
Agent Tracing
For complex agents with multiple steps:
- Langfuse: Best. Built specifically for hierarchical agent traces.
- Braintrust: Good. Span-based tracing works well.
- Helicone: Not suitable. It logs individual requests, not traces.
Evaluations
- Braintrust: Best. Experiments, A/B testing, and scoring are the core product.
- Langfuse: Good. Dataset management + scoring, but the experiment workflow is less polished.
- Helicone: Minimal. Basic feedback collection only.
Cost Analytics
All three show you cost per request and aggregate cost over time. Langfuse and Helicone break it down by user and feature; Braintrust focuses on cost per experiment.
Self-Hosting
- Langfuse: Full self-host with Docker Compose, Helm chart for Kubernetes. Actively maintained.
- Helicone: Self-host available on paid plans.
- Braintrust: Limited self-host options.
My Recommendation
For most teams starting out: Helicone. Zero setup, immediate visibility into costs and requests. Upgrade to something more sophisticated when you need it.
For teams with complex agents: Langfuse. The tracing capabilities are unmatched, the open-source license is reassuring, and the self-host option is production-grade.
For teams doing serious prompt engineering: Braintrust. The experiment and eval workflow is the best way to systematically improve your prompts.
For production teams: Langfuse (self-hosted) + Helicone's caching layer is a strong combination — you get deep tracing plus transparent caching without paying per-observation fees.