Reference Architecture · rag

Log Analysis RAG

Q: Should I embed every log line?

No, and this is the single most important decision. At 1TB/day you would spend more on embeddings than on the rest of the stack combined, and the signal-to-noise is terrible. Instead, cluster log lines into templates with Drain3, embed the templates (tens of thousands), and let a columnar store (ClickHouse) handle exact lookup and aggregation.

Q: Which LLM is best for root-cause analysis in 2026?

Claude Sonnet 4 is the default — strongest causal reasoning over noisy evidence and clean citations. GPT-4o is close behind. DeepSeek R1 is a cheaper reasoning-tuned alternative, ~5x cheaper per investigation but 2-3x higher latency. Gemini 2.5 Pro’s 2M context helps when you need to reason across huge trace bundles.

Q: ClickHouse or Loki or Datadog?

ClickHouse if you want columnar speed for anomaly detection and can staff it. Loki for simpler teams on the Grafana stack. Datadog if you are already on it and cost isn’t the primary driver. The architecture is the same — the LLM doesn’t care where logs live as long as you can run structured queries in <1s.

Q: How does the system avoid hallucinating root causes?

Two guardrails. First, force the model to cite specific (service, timestamp, trace_id) tuples for every claim, and validate post-generation that those traces actually exist. Second, require at least two supporting evidence citations per hypothesis — single-evidence hypotheses become ‘candidate’, not ‘likely cause’.

Q: How do I handle PII and secrets in logs?

Scrub at the ingest layer with Vector.dev regex rules (email, phone, AWS/GCP keys, CC patterns). Tag services that handle sensitive data; for those, route LLM calls to a self-hosted model (Llama 3.3 70B on vLLM) so raw content never leaves your network. Claude and GPT handle the rest.

Q: Can the LLM write SQL against ClickHouse?

Be careful. Free-form LLM SQL at 1TB/day can burn thousands of dollars in a single bad query. Use a prepared-query layer: ~20 parameterized ClickHouse query templates (ratio, p99, error_share, template_rate_window) and let the LLM only pick parameters. Cap cost at the gateway.

Q: How much does this cost at 1TB/day?

Self-hosted, budget $60-75k/mo all-in — ClickHouse storage dominates, then compute and LLM. Same volume on Datadog’s managed logs tier is $80-150k/mo just for ingestion. The self-host path pays back in weeks but demands real SRE capacity — don’t underestimate the ops load on ClickHouse.

Q: How do I prevent anomaly-detector alert fatigue?

Use rolling baselines that respect weekly seasonality (this Tuesday 03:14 vs last 4 Tuesdays 03:14). Rank anomalies by SLO burn rate, not novelty. Suppress anomalies in the first 5 minutes of a deploy and for 10 minutes after. An alert that isn’t burning SLO is informational, not actionable.

Last updated: April 16, 2026

Quick answer

Do not embed every log line — it is expensive and useless. Instead, cluster logs by Drain3 templates, embed the templates and sample lines, store in ClickHouse (structured) + Qdrant (semantic) side by side. On query: detect anomalies in ClickHouse first, retrieve semantically similar patterns from Qdrant, rerank with Voyage Rerank 2.5, and synthesize with Claude Sonnet 4 (strong at causal reasoning). Expect $0.15 to $0.40 per investigation. The structured-first, vector-second pattern is non-negotiable at log scale.

The problem

Your SRE team ingests 1TB/day of logs, traces, and metrics across Datadog, ClickHouse, Loki, and S3. An on-call engineer asks ‘why did checkout p99 spike at 03:14 UTC’ and today that means 40 minutes of dashboard-hopping. You need natural-language queries over logs that return cited log lines, correlated traces, suspected root causes, and a confidence score — in seconds, not hours.

Architecture

input

llm

data

infra

output

Log Ingester

Vector.dev or Fluent Bit pipelines ship logs from Kubernetes, EC2, Lambda, and edge into a normalized schema. Enriches with service, env, region, trace_id, span_id.

Alternatives: Fluent Bit, Fluentd, OpenTelemetry Collector, Vector + Kafka

Log Template Extractor (Drain3)

Drain3 clusters log lines into templates (e.g., ‘GET /checkout [duration_ms] for user [id]’). Turns a million raw lines/minute into a few hundred unique templates — the unit of embedding.

Alternatives: Spell, LogMine, LLM-based templating (slow)

Structured Store (ClickHouse)

Stores every log line with template_id, timestamp, service, env, trace_id, and numeric fields. Handles ‘how often did template X happen in service Y between T1 and T2’ in milliseconds.

Alternatives: Grafana Loki, Datadog Logs, BigQuery, ElasticSearch

Template Embedder

Embeds each template’s canonical form + a sample line + its doc-comment (service-owner metadata). You embed thousands of templates, not billions of lines — the key to making this affordable.

Alternatives: Voyage-3, Cohere Embed v3, BGE-M3 self-hosted

Vector DB (templates)

Qdrant stores template embeddings with metadata: service, severity, owner team, related trace pattern. Small — usually under 100k vectors total, even for massive orgs.

Alternatives: pgvector, Pinecone, Weaviate

Hybrid Retriever + Anomaly Join

For a natural-language query, runs semantic retrieval on templates, a structured ClickHouse query for time-window context, AND joins live anomaly signals (rolling z-score per template_id on rate/latency/error_share). Merges into a unified candidate set.

Alternatives: Semantic-only, Structured-only, Two-step: LLM→SQL→LLM

Reranker

Voyage Rerank 2.5 scores templates + sampled log bundles against the query. Cuts top 50 down to top 8 bundles the LLM will reason over.

Alternatives: Cohere Rerank v3, Jina Reranker v2

Root-cause Synthesizer

Claude Sonnet 4 — best causal reasoning over noisy evidence in 2026. Receives the query, top templates, anomaly signals, and a time-windowed trace bundle. Produces a hypothesis with evidence, confidence, and recommended next checks.

Alternatives: GPT-4o, Gemini 2.5 Pro, DeepSeek R1 (reasoning-heavy, cheaper)

Investigation UI

Slack bot + web UI. Renders the hypothesis, cited log lines (clickable to deep-link into Datadog/Loki), related traces, and ‘was this helpful’ feedback.

Alternatives: Slack workflow, Custom React, Datadog sidepanel

The stack

IngestionVector.dev + Kafka

Vector.dev handles transforms, enrichment, and sampling in the pipeline (not at query time) — critical at 1TB/day. Kafka buffers between ingest and the structured store so spikes don’t cascade.

Alternatives: Fluent Bit, OpenTelemetry Collector

Template extractionDrain3

Drain3 is the de facto standard for log templating. Handles online updates as new templates appear, and gives you a stable template_id you can join on. LLM-based templating is accurate but 1000x slower — don’t use it in the hot path.

Alternatives: Spell, LogMine

Structured storeClickHouse

ClickHouse crushes analytical queries over time-series log data and is cheap to self-host. Loki is simpler but slower for aggregations. Datadog is fine if you already pay — but structured-first RAG requires columnar speed for sub-second anomaly detection.

Alternatives: Grafana Loki, Datadog Logs, BigQuery

EmbeddingsOpenAI text-embedding-3-large

You embed templates (~10k-100k vectors total), not log lines. At that scale, the cheapest-per-token model is fine. Voyage-3 is ~3-5% better quality but the embedding budget is trivial here — pick whatever your platform already integrates with.

Alternatives: Voyage-3, BGE-M3

Vector DBQdrant (small, fast, local)

Small vector count (tens of thousands of templates), fast filters on service/severity. pgvector or even in-memory FAISS works. Managed Qdrant is the safe default.

Alternatives: pgvector, Pinecone

Anomaly detectionRolling z-score per template_id

Per-template z-score on rate, latency, and error share is cheap, interpretable, and 80% as good as fancier models. Run every minute in ClickHouse. Graduate to Prophet or Isolation Forest only once you have proven the simple version covers you.

Alternatives: Prophet, Isolation Forest, Datadog Watchdog

Answer LLMClaude Sonnet 4

Claude Sonnet 4 does best on causal reasoning over noisy, incomplete evidence — exactly what root-cause is. DeepSeek R1 is a cheaper reasoning-tuned alternative if you are cost-sensitive and can tolerate higher latency.

Alternatives: GPT-4o, Gemini 2.5 Pro, DeepSeek R1

Cost at each scale

Prototype

10 GB/day · 100 investigations/mo

$380/mo

Ingestion (Vector.dev, self-hosted)$30

Drain3 template extraction$0

ClickHouse (small managed)$150

Template embeddings (~5k)$1

Query embeddings (100)$0

Voyage Rerank 2.5 (100)$1

Claude Sonnet 4 investigations (100 × ~20k tok)$80

Qdrant Cloud starter$30

Hosting + Slack bot$88

Startup

100 GB/day · 2k investigations/mo

$4,800/mo

Vector.dev + Kafka (managed)$500

ClickHouse Cloud (medium)$1,500

Template embeddings (churn)$20

Query embeddings (2k)$1

Voyage Rerank 2.5 (2k)$20

Claude Sonnet 4 investigations$1,600

Anomaly detector compute$200

Qdrant Cloud standard$200

S3 cold log storage$200

Observability + hosting$559

Scale

1 TB/day · 25k investigations/mo

$62,000/mo

Vector.dev + Kafka (high-throughput)$5,000

ClickHouse self-hosted (HA, tiered storage)$18,000

Template embeddings (ongoing)$200

Query embeddings (25k)$20

Voyage Rerank 2.5 (25k)$250

Claude Sonnet 4 investigations$20,000

Anomaly detector compute + models$3,500

Qdrant Enterprise$4,000

S3 + Glacier cold tier$4,000

SRE + observability + hosting$7,030

Latency budget

Total P50: 4,245ms

Total P95: 8,300ms

Template match (live, in ingest)

5ms · 20ms p95

Query embedding

80ms · 180ms p95

Structured ClickHouse context query

300ms · 900ms p95

Semantic retrieval + anomaly merge

160ms · 320ms p95

Rerank to top-8 bundles

200ms · 380ms p95

LLM root-cause synthesis

3500ms · 6500ms p95

Median

P95

Tradeoffs

Embed templates vs embed log lines

Embedding every log line at 1TB/day is both ludicrously expensive and useless — most lines are duplicates of a template. Embed the templates (tens of thousands) and let the structured store handle exact lookup. This one decision is what makes log RAG affordable.

Claude Sonnet 4 vs DeepSeek R1 for reasoning

Sonnet 4 is the default — best causal reasoning, clean citations, fast enough. DeepSeek R1 is a cheaper reasoning-specific alternative: ~5x lower cost per investigation but 2-3x higher latency. Use Sonnet for interactive Slack investigations, R1 for scheduled batch RCA digests.

Own the log store vs Datadog/Splunk

Self-hosting ClickHouse at 1TB/day costs ~$15-20k/mo all-in. The same volume on Datadog is $80-150k/mo. The self-host path pays back in under 2 months but demands real SRE capacity. Hybrid (hot in ClickHouse, archive in S3, dashboards still in Datadog) is usually the right 6-month compromise.

Failure modes & guardrails

New template explosion during deploys

Mitigation: Rate-limit new template creation per service. When Drain3 sees >50 new templates/minute from one service, pause template creation and alert — usually a log-format change or a bug printing unique stack traces. Auto-cluster the overflow post-hoc.

LLM hallucinates root causes from correlated but unrelated events

Mitigation: Force the model to cite specific log line timestamps and trace IDs. Post-generation, validate every cited trace_id actually exists in ClickHouse for that time window. Reject hypotheses with < 2 supporting evidence citations.

Sensitive data (PII, secrets) leaks into the LLM

Mitigation: Scrub at the ingest layer with Vector.dev regex rules (email, phone, AWS keys, credit card). Tag services as ‘sensitive’ in metadata; for those services, route investigation to a self-hosted model (Llama 3.3 70B) instead of Claude/GPT.

Anomaly detector floods on seasonal traffic

Mitigation: Use rolling baselines that respect weekly seasonality (compare this Tuesday 03:14 to last 4 Tuesdays 03:14, not last 60 minutes). Couple with SLO burn rate — an anomaly that isn’t burning SLO is lower priority. Surface all of it but rank by impact, not novelty.

ClickHouse query cost explodes on open-ended questions

Mitigation: LLM-generated SQL is dangerous at 1TB/day. Use a prepared-query templating layer: the LLM picks one of ~20 parameterized ClickHouse query templates (ratio, p99, error_share, template_rate_window), never writes raw SQL. Cap query cost at the query budget gateway.

Frequently asked questions

Should I embed every log line?

No, and this is the single most important decision. At 1TB/day you would spend more on embeddings than on the rest of the stack combined, and the signal-to-noise is terrible. Instead, cluster log lines into templates with Drain3, embed the templates (tens of thousands), and let a columnar store (ClickHouse) handle exact lookup and aggregation.

Which LLM is best for root-cause analysis in 2026?

Claude Sonnet 4 is the default — strongest causal reasoning over noisy evidence and clean citations. GPT-4o is close behind. DeepSeek R1 is a cheaper reasoning-tuned alternative, ~5x cheaper per investigation but 2-3x higher latency. Gemini 2.5 Pro’s 2M context helps when you need to reason across huge trace bundles.

ClickHouse or Loki or Datadog?

ClickHouse if you want columnar speed for anomaly detection and can staff it. Loki for simpler teams on the Grafana stack. Datadog if you are already on it and cost isn’t the primary driver. The architecture is the same — the LLM doesn’t care where logs live as long as you can run structured queries in <1s.

How does the system avoid hallucinating root causes?

Two guardrails. First, force the model to cite specific (service, timestamp, trace_id) tuples for every claim, and validate post-generation that those traces actually exist. Second, require at least two supporting evidence citations per hypothesis — single-evidence hypotheses become ‘candidate’, not ‘likely cause’.

How do I handle PII and secrets in logs?

Scrub at the ingest layer with Vector.dev regex rules (email, phone, AWS/GCP keys, CC patterns). Tag services that handle sensitive data; for those, route LLM calls to a self-hosted model (Llama 3.3 70B on vLLM) so raw content never leaves your network. Claude and GPT handle the rest.

Can the LLM write SQL against ClickHouse?

Be careful. Free-form LLM SQL at 1TB/day can burn thousands of dollars in a single bad query. Use a prepared-query layer: ~20 parameterized ClickHouse query templates (ratio, p99, error_share, template_rate_window) and let the LLM only pick parameters. Cap cost at the gateway.

How much does this cost at 1TB/day?

Self-hosted, budget $60-75k/mo all-in — ClickHouse storage dominates, then compute and LLM. Same volume on Datadog’s managed logs tier is $80-150k/mo just for ingestion. The self-host path pays back in weeks but demands real SRE capacity — don’t underestimate the ops load on ClickHouse.

How do I prevent anomaly-detector alert fatigue?

Use rolling baselines that respect weekly seasonality (this Tuesday 03:14 vs last 4 Tuesdays 03:14). Rank anomalies by SLO burn rate, not novelty. Suppress anomalies in the first 5 minutes of a deploy and for 10 minutes after. An alert that isn’t burning SLO is informational, not actionable.

Architectures

Real-time News RAG

Reference architecture for RAG over minute-fresh news, RSS, and social feeds. Streaming ingestion, freshness-w...

Enterprise Document Search

Reference architecture for semantic search across 1M+ enterprise documents (PDFs, Confluence, Notion, Google D...

Models mentioned

claude-sonnet-4 deepseek-r1 gpt-4o

Tools mentioned

qdrant clickhouse voyage