Reference Architecture · rag

Real-time News RAG

Last updated: April 16, 2026

Quick answer

Stream ingestion through Kafka or Redpanda, parse with Trafilatura, dedupe with MinHash/SimHash, embed with OpenAI text-embedding-3-large (cheapest at this volume), store in Turbopuffer or Pinecone with a freshness-decay score, retrieve with a hybrid BM25 + dense query, and synthesize with Gemini 2.0 Flash or Claude Haiku 4 — both fast and cheap enough for minute-cadence news loops. Expect $0.02 to $0.08 per answer. The hardest problem isn’t retrieval, it’s deduplication and source credibility weighting.

The problem

You are building a news copilot, a financial research tool, or a social monitoring product. Articles are published every second across thousands of RSS feeds, APIs (Bloomberg, Reuters, NewsAPI), and social sources. A user asks ‘what happened with the Fed rate decision in the last hour’ — the system must already have that content indexed, rank by freshness AND relevance, dedupe near-identical wire stories, and synthesize a cited answer before the next news cycle.

Architecture

answer + cited sourcesStreaming IngesterINPUTArticle ExtractorDATADeduplicatorDATAEmbedding ModelLLMFreshness-aware Vector DBDATAFreshness-weighted RetrieverINFRARerankerLLMAnswer SynthesizerLLMNews UIOUTPUT
input
llm
data
infra
output

Streaming Ingester

Polls RSS, hits news APIs (NewsAPI, GDELT, Bloomberg, Reuters), and consumes social firehoses (X, Bluesky, Reddit). Normalizes to a common article schema and pushes to Kafka/Redpanda for downstream consumers.

Alternatives: Kafka, AWS Kinesis, Diffbot News API, SerpAPI

Article Extractor

Fetches the canonical URL, strips nav/footer/ads, extracts title, author, publish time, body, and images. Trafilatura handles 95% of modern news layouts.

Alternatives: Newspaper3k, Mercury Parser, Diffbot Extract

Deduplicator

MinHash + LSH (datasketch) or SimHash on normalized body text. Wire stories (AP/Reuters) get republished by hundreds of outlets — without dedup, retrieval returns the same content 50 times.

Alternatives: SimHash, Exact-hash on title+first-paragraph, Embedding cosine dedup

Embedding Model

OpenAI text-embedding-3-large is the cheapest-per-token model that still ranks competitively — matters at 100k+ articles/day.

Alternatives: Voyage-3 (better quality, higher cost), Cohere Embed v3, BGE-M3 (self-hosted)

Freshness-aware Vector DB

Turbopuffer or Pinecone with metadata: publish_time, source, source_credibility_score, topic. Retrieval re-ranks with a time-decay function so the last hour beats last week.

Alternatives: Pinecone Serverless, Qdrant, Weaviate

Freshness-weighted Retriever

Hybrid BM25 + dense. Final score = similarity × exp(-age_hours / tau) × source_credibility. tau is topic-dependent: breaking news ~2h, deep research ~168h.

Alternatives: Time-filtered retrieval, Weighted merging, Separate hot/cold indexes

Reranker

Cohere Rerank v3 cuts the top 50 to top 8. Cheap and fast enough for minute-cadence pipelines.

Alternatives: Voyage Rerank 2.5, Jina Reranker v2

Answer Synthesizer

Gemini 2.0 Flash or Claude Haiku 4. Both stream fast and cost pennies per answer. Prompted to cite (publisher, headline, timestamp) for each claim.

Alternatives: claude-haiku-4, gpt-4o-mini, llama-3.3-on-groq

News UI

Streams the answer with live citations (publisher logo, headline, time-ago). Clicking a citation opens the source. Freshness indicator: ‘answer based on articles from the last 47 minutes’.

Alternatives: Slack bot, Email digest, Custom React

The stack

Streaming ingestRedpanda

Redpanda is Kafka-compatible with simpler ops and lower latency. At news volumes (100k-1M articles/day) you want a durable log — not a queue — so replays and backfills are cheap.

Alternatives: Kafka, AWS Kinesis, Google Pub/Sub

Article extractionTrafilatura

Open-source, handles 95% of modern news layouts, and actively maintained. Diffbot is better quality but at news-scale volumes the cost adds up. Use Trafilatura as default and Diffbot only for high-value sources.

Alternatives: Newspaper3k, Diffbot Extract

DeduplicationMinHash + LSH (datasketch)

Wire stories from AP/Reuters get republished across hundreds of outlets — without dedup, a query for ‘Fed rate decision’ returns 50 copies of the same content. MinHash catches near-duplicates with edited headlines/intros. Runs in milliseconds at ingest.

Alternatives: SimHash, Embedding cosine dedup

EmbeddingsOpenAI text-embedding-3-large

Cheapest per token at the volume news RAG demands (100k-1M embeddings/day). Voyage-3 is ~3-5% higher quality but 2x the cost. BGE-M3 is free to self-host if you already have GPU capacity.

Alternatives: Voyage-3, BGE-M3 self-hosted

Vector DBTurbopuffer

Turbopuffer is built for exactly this workload — high-ingest, time-partitioned, cheap storage. At 1M+ articles/day retained for 30 days, it’s 5-10x cheaper than Pinecone. Pinecone Serverless is the safer managed default if storage cost isn’t an issue yet.

Alternatives: Pinecone Serverless, Qdrant

Freshness scoringExponential decay in rerank step

Don’t filter — decay. Hard filters (‘only last 24h’) miss context; exponential decay `score × exp(-age / tau)` naturally prefers fresh while still surfacing the seminal story when it’s the best match.

Alternatives: Hard time filters, Multi-stage retrieval (hot + cold)

Answer LLMGemini 2.0 Flash or Claude Haiku 4

News answers need to stream in under 1 second. Both Gemini 2.0 Flash and Haiku 4 deliver this at sub-penny cost per answer. Groq-hosted Llama 3.3 is faster still (500 tok/s) if you are latency-obsessed.

Alternatives: gpt-4o-mini, llama-3.3-on-groq

Cost at each scale

Prototype

10k articles/day · 5k queries/mo

$320/mo

RSS/API ingestion infra$50
Article extraction compute$30
Embeddings (300k articles/mo)$20
Query embeddings (5k)$1
Cohere Rerank v3 (5k)$5
Gemini 2.0 Flash answers (5k × ~4k tok)$6
Turbopuffer starter$50
Redpanda serverless$79
Hosting + observability$79

Startup

200k articles/day · 100k queries/mo

$4,800/mo

Ingestion + NewsAPI/Diffbot subscriptions$900
Extraction compute (Trafilatura workers)$300
Embeddings (6M articles/mo)$400
Query embeddings (100k)$8
Cohere Rerank v3 (100k)$100
Gemini 2.0 Flash + Haiku mix$220
Turbopuffer (30-day retention, ~180M vectors)$900
Redpanda standard$500
Dedup + credibility service + observability$800
Hosting + SRE$672

Scale

1M articles/day · 2M queries/mo

$42,000/mo

Premium feeds (Bloomberg, Reuters, LSEG)$8,000
Ingestion + extraction (GPU workers)$3,500
Embeddings (30M/mo)$2,000
Query embeddings (2M)$160
Cohere Rerank v3 (2M)$2,000
Answer LLMs (mix of Flash/Haiku/Sonnet)$4,500
Turbopuffer self-hosted + S3$6,000
Redpanda Cloud Pro$3,500
Dedup + credibility + topic classifier$4,000
SRE + observability + hosting$8,340

Latency budget

Total P50: 31,590ms
Total P95: 93,120ms
Ingest → searchable (async, not user-facing)
30000ms · 90000ms p95
Query embedding
70ms · 160ms p95
Hybrid retrieval + freshness weighting
120ms · 280ms p95
Rerank to top-8
150ms · 280ms p95
LLM answer (first token)
350ms · 700ms p95
LLM answer (full stream)
900ms · 1700ms p95
Median
P95

Tradeoffs

Hard time filter vs exponential decay

A hard ‘last 24 hours’ filter is simple but misses the seminal article from 3 days ago when the user asks a follow-up. Exponential decay (`score × exp(-age / tau)`) keeps fresh content ranked higher while still surfacing canonical context. Default to decay; use hard filters only when the user explicitly says ‘today only’.

Cheap LLM (Flash/Haiku) vs smart LLM (Sonnet 4/GPT-4o)

For news summarization and simple Q&A, Gemini 2.0 Flash and Claude Haiku 4 are indistinguishable from premium models at 5-10x lower cost. Reserve Sonnet 4 or GPT-4o for multi-story synthesis, financial analysis, and timeline reconstruction — queries where model quality materially changes the answer.

Own the pipeline vs use Diffbot/Exa

Managed news APIs (Diffbot, Exa, SerpAPI) get you to working in a day, cost $500-2k/mo at prototype scale, and hide the dedup/credibility problem. Own the pipeline (RSS + Trafilatura + dedup) and you pay ~$200/mo in compute but ship weeks later. Inflection point is around 100k+ articles/day or when source credibility becomes a product feature.

Failure modes & guardrails

Wire stories dominate results (same content 50 times)

Mitigation: MinHash + LSH dedup at ingest with a similarity threshold of 0.85. Cluster near-duplicates and promote the earliest-published canonical version. Retrieval returns at most one article per cluster. Tune threshold weekly against a labeled eval set.

Low-credibility sources pollute answers

Mitigation: Maintain a per-source credibility score (manual curation or third-party scoring like NewsGuard/Ad Fontes). Multiply retrieval score by source_credibility. For health, finance, or political topics, hard-filter to sources above a threshold. Surface credibility in the UI.

Breaking-news cascade fills the index with one story

Mitigation: Per-topic rate limiting at ingest: a topic cluster that spikes from 0 to 500 articles/hour gets capped (keep the first 50, drop the rest). Combined with dedup this prevents a single event from drowning the vector DB — and it protects the freshness signal for adjacent topics.

Index grows unbounded and cost explodes

Mitigation: Time-partition the vector DB (Turbopuffer does this natively). Retain 30-90 days hot, move the rest to cold S3 with on-demand rehydration. Most news queries care about the last week; long-tail queries are rare enough that cold retrieval is acceptable.

Stale answers because ingestion is lagged

Mitigation: Track ingest-to-searchable latency as a first-class SLO (target < 60s p95). Expose it in the UI (‘answer based on articles from the last N minutes’). Alert on any lag above 3 minutes — that is the threshold at which users feel the tool is out of date.

Frequently asked questions

How fresh can RAG answers be?

With streaming ingest (Redpanda/Kafka), Trafilatura extraction, and a freshness-aware vector DB like Turbopuffer, ingest-to-searchable lands in 30-90 seconds p50. That means a news event published a minute ago is already retrievable. Past ~30 seconds, users generally cannot tell the difference, so 60s p95 is a good SLO.

Which vector DB is best for high-volume streaming ingest?

Turbopuffer is built for this workload — high ingest, time-partitioned, cheap cold storage. At 1M+ articles/day with 30-day retention it’s 5-10x cheaper than Pinecone. Pinecone Serverless is the safer managed default if cost isn’t yet dominant. Qdrant works but you own the scaling.

How do I deduplicate wire stories?

MinHash + LSH (via datasketch in Python) with a Jaccard similarity threshold around 0.85 catches >95% of wire republications. Runs in milliseconds at ingest. Cluster near-duplicates, promote the earliest-published canonical, suppress the rest from retrieval. SimHash is a cheaper alternative if you are memory-constrained.

Which LLM should I use for news answers?

Gemini 2.0 Flash and Claude Haiku 4 are the 2026 defaults — both stream fast (>100 tok/s first-token) and cost under a penny per answer. Groq-hosted Llama 3.3 runs at 500 tok/s if you are latency-obsessed. Reserve Sonnet 4 or GPT-4o for multi-story synthesis where model quality actually changes the answer.

How do I rank by freshness without losing relevance?

Don’t filter, decay. Compute `final_score = similarity × exp(-age_hours / tau)` where tau is topic-dependent (~2h for breaking news, ~168h for deep research). Multiply by source_credibility. A 30-minute-old article from a credible source beats a 3-day-old one; a seminal piece still surfaces when nothing fresher matches.

How do I handle source credibility?

Maintain a per-source score 0.0-1.0 (manual curation, NewsGuard, Ad Fontes, or a hybrid). Multiply retrieval score by credibility. For health/finance/political topics, hard-filter below a threshold (e.g. 0.7). Always surface the credibility score in the UI so users can judge the answer themselves.

Can I use this for financial news and market data?

Yes, and it’s one of the highest-ROI use cases. Couple the news pipeline with a structured market-data source (Polygon, Alpaca) so the answer can cite ‘stock moved X% after this headline at Y:ZZ’. Premium feeds (Bloomberg B-Pipe, Reuters) add latency-sensitive coverage but cost $5-10k/mo per seat.

How do I stop the index from growing forever?

Time-partition the vector DB. Keep 30-90 days hot in Turbopuffer/Pinecone, move the rest to cold S3 with on-demand rehydration. Most news queries are recent; long-tail historical queries can accept a 2-3 second cold retrieval penalty. Retention policy is a per-customer decision but 90 days is a sane default.

Related