Reference Architecture · rag
Log Analysis RAG
Last updated: April 16, 2026
Quick answer
Do not embed every log line — it is expensive and useless. Instead, cluster logs by Drain3 templates, embed the templates and sample lines, store in ClickHouse (structured) + Qdrant (semantic) side by side. On query: detect anomalies in ClickHouse first, retrieve semantically similar patterns from Qdrant, rerank with Voyage Rerank 2.5, and synthesize with Claude Sonnet 4 (strong at causal reasoning). Expect $0.15 to $0.40 per investigation. The structured-first, vector-second pattern is non-negotiable at log scale.
The problem
Your SRE team ingests 1TB/day of logs, traces, and metrics across Datadog, ClickHouse, Loki, and S3. An on-call engineer asks ‘why did checkout p99 spike at 03:14 UTC’ and today that means 40 minutes of dashboard-hopping. You need natural-language queries over logs that return cited log lines, correlated traces, suspected root causes, and a confidence score — in seconds, not hours.
Architecture
Log Ingester
Vector.dev or Fluent Bit pipelines ship logs from Kubernetes, EC2, Lambda, and edge into a normalized schema. Enriches with service, env, region, trace_id, span_id.
Alternatives: Fluent Bit, Fluentd, OpenTelemetry Collector, Vector + Kafka
Log Template Extractor (Drain3)
Drain3 clusters log lines into templates (e.g., ‘GET /checkout [duration_ms] for user [id]’). Turns a million raw lines/minute into a few hundred unique templates — the unit of embedding.
Alternatives: Spell, LogMine, LLM-based templating (slow)
Structured Store (ClickHouse)
Stores every log line with template_id, timestamp, service, env, trace_id, and numeric fields. Handles ‘how often did template X happen in service Y between T1 and T2’ in milliseconds.
Alternatives: Grafana Loki, Datadog Logs, BigQuery, ElasticSearch
Template Embedder
Embeds each template’s canonical form + a sample line + its doc-comment (service-owner metadata). You embed thousands of templates, not billions of lines — the key to making this affordable.
Alternatives: Voyage-3, Cohere Embed v3, BGE-M3 self-hosted
Vector DB (templates)
Qdrant stores template embeddings with metadata: service, severity, owner team, related trace pattern. Small — usually under 100k vectors total, even for massive orgs.
Alternatives: pgvector, Pinecone, Weaviate
Hybrid Retriever + Anomaly Join
For a natural-language query, runs semantic retrieval on templates, a structured ClickHouse query for time-window context, AND joins live anomaly signals (rolling z-score per template_id on rate/latency/error_share). Merges into a unified candidate set.
Alternatives: Semantic-only, Structured-only, Two-step: LLM→SQL→LLM
Reranker
Voyage Rerank 2.5 scores templates + sampled log bundles against the query. Cuts top 50 down to top 8 bundles the LLM will reason over.
Alternatives: Cohere Rerank v3, Jina Reranker v2
Root-cause Synthesizer
Claude Sonnet 4 — best causal reasoning over noisy evidence in 2026. Receives the query, top templates, anomaly signals, and a time-windowed trace bundle. Produces a hypothesis with evidence, confidence, and recommended next checks.
Alternatives: GPT-4o, Gemini 2.5 Pro, DeepSeek R1 (reasoning-heavy, cheaper)
Investigation UI
Slack bot + web UI. Renders the hypothesis, cited log lines (clickable to deep-link into Datadog/Loki), related traces, and ‘was this helpful’ feedback.
Alternatives: Slack workflow, Custom React, Datadog sidepanel
The stack
Vector.dev handles transforms, enrichment, and sampling in the pipeline (not at query time) — critical at 1TB/day. Kafka buffers between ingest and the structured store so spikes don’t cascade.
Alternatives: Fluent Bit, OpenTelemetry Collector
Drain3 is the de facto standard for log templating. Handles online updates as new templates appear, and gives you a stable template_id you can join on. LLM-based templating is accurate but 1000x slower — don’t use it in the hot path.
Alternatives: Spell, LogMine
ClickHouse crushes analytical queries over time-series log data and is cheap to self-host. Loki is simpler but slower for aggregations. Datadog is fine if you already pay — but structured-first RAG requires columnar speed for sub-second anomaly detection.
Alternatives: Grafana Loki, Datadog Logs, BigQuery
You embed templates (~10k-100k vectors total), not log lines. At that scale, the cheapest-per-token model is fine. Voyage-3 is ~3-5% better quality but the embedding budget is trivial here — pick whatever your platform already integrates with.
Alternatives: Voyage-3, BGE-M3
Small vector count (tens of thousands of templates), fast filters on service/severity. pgvector or even in-memory FAISS works. Managed Qdrant is the safe default.
Alternatives: pgvector, Pinecone
Per-template z-score on rate, latency, and error share is cheap, interpretable, and 80% as good as fancier models. Run every minute in ClickHouse. Graduate to Prophet or Isolation Forest only once you have proven the simple version covers you.
Alternatives: Prophet, Isolation Forest, Datadog Watchdog
Claude Sonnet 4 does best on causal reasoning over noisy, incomplete evidence — exactly what root-cause is. DeepSeek R1 is a cheaper reasoning-tuned alternative if you are cost-sensitive and can tolerate higher latency.
Alternatives: GPT-4o, Gemini 2.5 Pro, DeepSeek R1
Cost at each scale
Prototype
10 GB/day · 100 investigations/mo
$380/mo
Startup
100 GB/day · 2k investigations/mo
$4,800/mo
Scale
1 TB/day · 25k investigations/mo
$62,000/mo
Latency budget
Tradeoffs
Embed templates vs embed log lines
Embedding every log line at 1TB/day is both ludicrously expensive and useless — most lines are duplicates of a template. Embed the templates (tens of thousands) and let the structured store handle exact lookup. This one decision is what makes log RAG affordable.
Claude Sonnet 4 vs DeepSeek R1 for reasoning
Sonnet 4 is the default — best causal reasoning, clean citations, fast enough. DeepSeek R1 is a cheaper reasoning-specific alternative: ~5x lower cost per investigation but 2-3x higher latency. Use Sonnet for interactive Slack investigations, R1 for scheduled batch RCA digests.
Own the log store vs Datadog/Splunk
Self-hosting ClickHouse at 1TB/day costs ~$15-20k/mo all-in. The same volume on Datadog is $80-150k/mo. The self-host path pays back in under 2 months but demands real SRE capacity. Hybrid (hot in ClickHouse, archive in S3, dashboards still in Datadog) is usually the right 6-month compromise.
Failure modes & guardrails
New template explosion during deploys
Mitigation: Rate-limit new template creation per service. When Drain3 sees >50 new templates/minute from one service, pause template creation and alert — usually a log-format change or a bug printing unique stack traces. Auto-cluster the overflow post-hoc.
LLM hallucinates root causes from correlated but unrelated events
Mitigation: Force the model to cite specific log line timestamps and trace IDs. Post-generation, validate every cited trace_id actually exists in ClickHouse for that time window. Reject hypotheses with < 2 supporting evidence citations.
Sensitive data (PII, secrets) leaks into the LLM
Mitigation: Scrub at the ingest layer with Vector.dev regex rules (email, phone, AWS keys, credit card). Tag services as ‘sensitive’ in metadata; for those services, route investigation to a self-hosted model (Llama 3.3 70B) instead of Claude/GPT.
Anomaly detector floods on seasonal traffic
Mitigation: Use rolling baselines that respect weekly seasonality (compare this Tuesday 03:14 to last 4 Tuesdays 03:14, not last 60 minutes). Couple with SLO burn rate — an anomaly that isn’t burning SLO is lower priority. Surface all of it but rank by impact, not novelty.
ClickHouse query cost explodes on open-ended questions
Mitigation: LLM-generated SQL is dangerous at 1TB/day. Use a prepared-query templating layer: the LLM picks one of ~20 parameterized ClickHouse query templates (ratio, p99, error_share, template_rate_window), never writes raw SQL. Cap query cost at the query budget gateway.
Frequently asked questions
Should I embed every log line?
No, and this is the single most important decision. At 1TB/day you would spend more on embeddings than on the rest of the stack combined, and the signal-to-noise is terrible. Instead, cluster log lines into templates with Drain3, embed the templates (tens of thousands), and let a columnar store (ClickHouse) handle exact lookup and aggregation.
Which LLM is best for root-cause analysis in 2026?
Claude Sonnet 4 is the default — strongest causal reasoning over noisy evidence and clean citations. GPT-4o is close behind. DeepSeek R1 is a cheaper reasoning-tuned alternative, ~5x cheaper per investigation but 2-3x higher latency. Gemini 2.5 Pro’s 2M context helps when you need to reason across huge trace bundles.
ClickHouse or Loki or Datadog?
ClickHouse if you want columnar speed for anomaly detection and can staff it. Loki for simpler teams on the Grafana stack. Datadog if you are already on it and cost isn’t the primary driver. The architecture is the same — the LLM doesn’t care where logs live as long as you can run structured queries in <1s.
How does the system avoid hallucinating root causes?
Two guardrails. First, force the model to cite specific (service, timestamp, trace_id) tuples for every claim, and validate post-generation that those traces actually exist. Second, require at least two supporting evidence citations per hypothesis — single-evidence hypotheses become ‘candidate’, not ‘likely cause’.
How do I handle PII and secrets in logs?
Scrub at the ingest layer with Vector.dev regex rules (email, phone, AWS/GCP keys, CC patterns). Tag services that handle sensitive data; for those, route LLM calls to a self-hosted model (Llama 3.3 70B on vLLM) so raw content never leaves your network. Claude and GPT handle the rest.
Can the LLM write SQL against ClickHouse?
Be careful. Free-form LLM SQL at 1TB/day can burn thousands of dollars in a single bad query. Use a prepared-query layer: ~20 parameterized ClickHouse query templates (ratio, p99, error_share, template_rate_window) and let the LLM only pick parameters. Cap cost at the gateway.
How much does this cost at 1TB/day?
Self-hosted, budget $60-75k/mo all-in — ClickHouse storage dominates, then compute and LLM. Same volume on Datadog’s managed logs tier is $80-150k/mo just for ingestion. The self-host path pays back in weeks but demands real SRE capacity — don’t underestimate the ops load on ClickHouse.
How do I prevent anomaly-detector alert fatigue?
Use rolling baselines that respect weekly seasonality (this Tuesday 03:14 vs last 4 Tuesdays 03:14). Rank anomalies by SLO burn rate, not novelty. Suppress anomalies in the first 5 minutes of a deploy and for 10 minutes after. An alert that isn’t burning SLO is informational, not actionable.