Reference Architecture · rag
Real-time News RAG
Last updated: April 16, 2026
Quick answer
Stream ingestion through Kafka or Redpanda, parse with Trafilatura, dedupe with MinHash/SimHash, embed with OpenAI text-embedding-3-large (cheapest at this volume), store in Turbopuffer or Pinecone with a freshness-decay score, retrieve with a hybrid BM25 + dense query, and synthesize with Gemini 2.0 Flash or Claude Haiku 4 — both fast and cheap enough for minute-cadence news loops. Expect $0.02 to $0.08 per answer. The hardest problem isn’t retrieval, it’s deduplication and source credibility weighting.
The problem
You are building a news copilot, a financial research tool, or a social monitoring product. Articles are published every second across thousands of RSS feeds, APIs (Bloomberg, Reuters, NewsAPI), and social sources. A user asks ‘what happened with the Fed rate decision in the last hour’ — the system must already have that content indexed, rank by freshness AND relevance, dedupe near-identical wire stories, and synthesize a cited answer before the next news cycle.
Architecture
Streaming Ingester
Polls RSS, hits news APIs (NewsAPI, GDELT, Bloomberg, Reuters), and consumes social firehoses (X, Bluesky, Reddit). Normalizes to a common article schema and pushes to Kafka/Redpanda for downstream consumers.
Alternatives: Kafka, AWS Kinesis, Diffbot News API, SerpAPI
Article Extractor
Fetches the canonical URL, strips nav/footer/ads, extracts title, author, publish time, body, and images. Trafilatura handles 95% of modern news layouts.
Alternatives: Newspaper3k, Mercury Parser, Diffbot Extract
Deduplicator
MinHash + LSH (datasketch) or SimHash on normalized body text. Wire stories (AP/Reuters) get republished by hundreds of outlets — without dedup, retrieval returns the same content 50 times.
Alternatives: SimHash, Exact-hash on title+first-paragraph, Embedding cosine dedup
Embedding Model
OpenAI text-embedding-3-large is the cheapest-per-token model that still ranks competitively — matters at 100k+ articles/day.
Alternatives: Voyage-3 (better quality, higher cost), Cohere Embed v3, BGE-M3 (self-hosted)
Freshness-aware Vector DB
Turbopuffer or Pinecone with metadata: publish_time, source, source_credibility_score, topic. Retrieval re-ranks with a time-decay function so the last hour beats last week.
Alternatives: Pinecone Serverless, Qdrant, Weaviate
Freshness-weighted Retriever
Hybrid BM25 + dense. Final score = similarity × exp(-age_hours / tau) × source_credibility. tau is topic-dependent: breaking news ~2h, deep research ~168h.
Alternatives: Time-filtered retrieval, Weighted merging, Separate hot/cold indexes
Reranker
Cohere Rerank v3 cuts the top 50 to top 8. Cheap and fast enough for minute-cadence pipelines.
Alternatives: Voyage Rerank 2.5, Jina Reranker v2
Answer Synthesizer
Gemini 2.0 Flash or Claude Haiku 4. Both stream fast and cost pennies per answer. Prompted to cite (publisher, headline, timestamp) for each claim.
Alternatives: claude-haiku-4, gpt-4o-mini, llama-3.3-on-groq
News UI
Streams the answer with live citations (publisher logo, headline, time-ago). Clicking a citation opens the source. Freshness indicator: ‘answer based on articles from the last 47 minutes’.
Alternatives: Slack bot, Email digest, Custom React
The stack
Redpanda is Kafka-compatible with simpler ops and lower latency. At news volumes (100k-1M articles/day) you want a durable log — not a queue — so replays and backfills are cheap.
Alternatives: Kafka, AWS Kinesis, Google Pub/Sub
Open-source, handles 95% of modern news layouts, and actively maintained. Diffbot is better quality but at news-scale volumes the cost adds up. Use Trafilatura as default and Diffbot only for high-value sources.
Alternatives: Newspaper3k, Diffbot Extract
Wire stories from AP/Reuters get republished across hundreds of outlets — without dedup, a query for ‘Fed rate decision’ returns 50 copies of the same content. MinHash catches near-duplicates with edited headlines/intros. Runs in milliseconds at ingest.
Alternatives: SimHash, Embedding cosine dedup
Cheapest per token at the volume news RAG demands (100k-1M embeddings/day). Voyage-3 is ~3-5% higher quality but 2x the cost. BGE-M3 is free to self-host if you already have GPU capacity.
Alternatives: Voyage-3, BGE-M3 self-hosted
Turbopuffer is built for exactly this workload — high-ingest, time-partitioned, cheap storage. At 1M+ articles/day retained for 30 days, it’s 5-10x cheaper than Pinecone. Pinecone Serverless is the safer managed default if storage cost isn’t an issue yet.
Alternatives: Pinecone Serverless, Qdrant
Don’t filter — decay. Hard filters (‘only last 24h’) miss context; exponential decay `score × exp(-age / tau)` naturally prefers fresh while still surfacing the seminal story when it’s the best match.
Alternatives: Hard time filters, Multi-stage retrieval (hot + cold)
News answers need to stream in under 1 second. Both Gemini 2.0 Flash and Haiku 4 deliver this at sub-penny cost per answer. Groq-hosted Llama 3.3 is faster still (500 tok/s) if you are latency-obsessed.
Alternatives: gpt-4o-mini, llama-3.3-on-groq
Cost at each scale
Prototype
10k articles/day · 5k queries/mo
$320/mo
Startup
200k articles/day · 100k queries/mo
$4,800/mo
Scale
1M articles/day · 2M queries/mo
$42,000/mo
Latency budget
Tradeoffs
Hard time filter vs exponential decay
A hard ‘last 24 hours’ filter is simple but misses the seminal article from 3 days ago when the user asks a follow-up. Exponential decay (`score × exp(-age / tau)`) keeps fresh content ranked higher while still surfacing canonical context. Default to decay; use hard filters only when the user explicitly says ‘today only’.
Cheap LLM (Flash/Haiku) vs smart LLM (Sonnet 4/GPT-4o)
For news summarization and simple Q&A, Gemini 2.0 Flash and Claude Haiku 4 are indistinguishable from premium models at 5-10x lower cost. Reserve Sonnet 4 or GPT-4o for multi-story synthesis, financial analysis, and timeline reconstruction — queries where model quality materially changes the answer.
Own the pipeline vs use Diffbot/Exa
Managed news APIs (Diffbot, Exa, SerpAPI) get you to working in a day, cost $500-2k/mo at prototype scale, and hide the dedup/credibility problem. Own the pipeline (RSS + Trafilatura + dedup) and you pay ~$200/mo in compute but ship weeks later. Inflection point is around 100k+ articles/day or when source credibility becomes a product feature.
Failure modes & guardrails
Wire stories dominate results (same content 50 times)
Mitigation: MinHash + LSH dedup at ingest with a similarity threshold of 0.85. Cluster near-duplicates and promote the earliest-published canonical version. Retrieval returns at most one article per cluster. Tune threshold weekly against a labeled eval set.
Low-credibility sources pollute answers
Mitigation: Maintain a per-source credibility score (manual curation or third-party scoring like NewsGuard/Ad Fontes). Multiply retrieval score by source_credibility. For health, finance, or political topics, hard-filter to sources above a threshold. Surface credibility in the UI.
Breaking-news cascade fills the index with one story
Mitigation: Per-topic rate limiting at ingest: a topic cluster that spikes from 0 to 500 articles/hour gets capped (keep the first 50, drop the rest). Combined with dedup this prevents a single event from drowning the vector DB — and it protects the freshness signal for adjacent topics.
Index grows unbounded and cost explodes
Mitigation: Time-partition the vector DB (Turbopuffer does this natively). Retain 30-90 days hot, move the rest to cold S3 with on-demand rehydration. Most news queries care about the last week; long-tail queries are rare enough that cold retrieval is acceptable.
Stale answers because ingestion is lagged
Mitigation: Track ingest-to-searchable latency as a first-class SLO (target < 60s p95). Expose it in the UI (‘answer based on articles from the last N minutes’). Alert on any lag above 3 minutes — that is the threshold at which users feel the tool is out of date.
Frequently asked questions
How fresh can RAG answers be?
With streaming ingest (Redpanda/Kafka), Trafilatura extraction, and a freshness-aware vector DB like Turbopuffer, ingest-to-searchable lands in 30-90 seconds p50. That means a news event published a minute ago is already retrievable. Past ~30 seconds, users generally cannot tell the difference, so 60s p95 is a good SLO.
Which vector DB is best for high-volume streaming ingest?
Turbopuffer is built for this workload — high ingest, time-partitioned, cheap cold storage. At 1M+ articles/day with 30-day retention it’s 5-10x cheaper than Pinecone. Pinecone Serverless is the safer managed default if cost isn’t yet dominant. Qdrant works but you own the scaling.
How do I deduplicate wire stories?
MinHash + LSH (via datasketch in Python) with a Jaccard similarity threshold around 0.85 catches >95% of wire republications. Runs in milliseconds at ingest. Cluster near-duplicates, promote the earliest-published canonical, suppress the rest from retrieval. SimHash is a cheaper alternative if you are memory-constrained.
Which LLM should I use for news answers?
Gemini 2.0 Flash and Claude Haiku 4 are the 2026 defaults — both stream fast (>100 tok/s first-token) and cost under a penny per answer. Groq-hosted Llama 3.3 runs at 500 tok/s if you are latency-obsessed. Reserve Sonnet 4 or GPT-4o for multi-story synthesis where model quality actually changes the answer.
How do I rank by freshness without losing relevance?
Don’t filter, decay. Compute `final_score = similarity × exp(-age_hours / tau)` where tau is topic-dependent (~2h for breaking news, ~168h for deep research). Multiply by source_credibility. A 30-minute-old article from a credible source beats a 3-day-old one; a seminal piece still surfaces when nothing fresher matches.
How do I handle source credibility?
Maintain a per-source score 0.0-1.0 (manual curation, NewsGuard, Ad Fontes, or a hybrid). Multiply retrieval score by credibility. For health/finance/political topics, hard-filter below a threshold (e.g. 0.7). Always surface the credibility score in the UI so users can judge the answer themselves.
Can I use this for financial news and market data?
Yes, and it’s one of the highest-ROI use cases. Couple the news pipeline with a structured market-data source (Polygon, Alpaca) so the answer can cite ‘stock moved X% after this headline at Y:ZZ’. Premium feeds (Bloomberg B-Pipe, Reuters) add latency-sensitive coverage but cost $5-10k/mo per seat.
How do I stop the index from growing forever?
Time-partition the vector DB. Keep 30-90 days hot in Turbopuffer/Pinecone, move the rest to cold S3 with on-demand rehydration. Most news queries are recent; long-tail historical queries can accept a 2-3 second cold retrieval penalty. Retention policy is a per-customer decision but 90 days is a sane default.