Reference Architecture · rag

Slack + Notion Internal Search

Last updated: April 16, 2026

Quick answer

Run source-specific connectors that write to a Kafka topic, normalize into a shared chunk schema with ACL metadata (user_ids or group_ids that can read), embed with Voyage-3-large, store in Qdrant with mandatory ACL filter payloads, rerank with Voyage Rerank 2.5, and synthesize with Claude Sonnet 4. Expect $0.08 to $0.25 per query. The entire product is permissions — get them wrong and you leak salary discussions to interns. Webhooks are mandatory; polling lags by hours and misses deletions.

The problem

Your team’s knowledge is scattered across Slack threads, Notion pages, Linear tickets, GitHub PRs, and Google Docs. An engineer asks ‘what did we decide about the migration rollback plan’ and the answer lives in a Slack huddle summary from 3 weeks ago. You need Glean-style search that respects per-user permissions (a contractor can’t see the board channel), indexes incrementally via webhooks (no nightly full reindex), and cites the exact message or page — all answering in under 3 seconds.

Architecture

input

llm

data

infra

output

Source Connectors + Change Queue

Webhook listeners + backfill workers for Slack, Notion, Linear, Google Drive, and GitHub. Emit normalized create/update/delete events onto a Kafka/Redpanda topic that decouples ingest from indexing — consumers handle reindex, ACL update, and deletion propagation independently.

Alternatives: Airbyte, Ragie, Nuclia, Kafka, Glean connectors (paid)

ACL Resolver

For each ingested item, computes which user_ids + group_ids can read it. Slack channel membership, Notion page shares, Google Drive ACLs, GitHub repo access. Snapshots permission graph for fast filtering.

Alternatives: SpiceDB, Oso, Custom graph in Postgres

Source-aware Chunker

Slack: thread-level. Notion: heading-level blocks. Linear: ticket + comments. GitHub: PR description + reviews. Each source has its own semantic unit — never chunk Slack by 512 tokens.

Alternatives: Contextual retrieval, Fixed-size, Semantic chunking

Embedding Model

Voyage-3-large — best general embeddings, handles the mix of chat, docs, code, and tickets well.

Alternatives: OpenAI text-embedding-3-large, Cohere Embed v3

Vector DB (ACL-indexed)

Qdrant with ACL arrays as payload. A query filters `acl_users contains user_id OR acl_groups && user_groups` BEFORE similarity. Zero-trust: the LLM never sees content the user can’t read.

Alternatives: Weaviate, pgvector, Pinecone

Permissions-aware Retriever

Pulls the user’s group memberships from the ACL resolver, filters the vector DB, runs hybrid (dense + BM25), and returns top 40 candidates.

Alternatives: Post-filter (unsafe), Dense-only, Weighted hybrid

Reranker

Voyage Rerank 2.5 picks top 6 from 40. Handles mixed-source candidates (a Slack thread, a Notion page, a Linear ticket) correctly.

Alternatives: Cohere Rerank v3, Jina Reranker v2

Answer Synthesizer

Claude Sonnet 4 — best at following citation format and disambiguating similar-looking threads. Answers with inline links to Slack/Notion/Linear/GitHub.

Alternatives: GPT-4o, Gemini 2.5 Pro

Search UI

Slack slash command + web app + browser extension. Answer streams with clickable citations (message link, page link, ticket link). Deep links respect permissions and never leak protected URLs.

Alternatives: Slack bot only, Raycast extension, Chrome extension

The stack

IngestionSource-native webhooks + Kafka topic

Webhooks give sub-minute freshness and, critically, deletion events. Polling misses deletes — a Notion page deleted at 10am will still answer queries at 2pm. For Slack, Notion, Linear, and GitHub, webhooks are mandatory. Use polling only for Google Drive (still lacks a mature delete webhook).

Alternatives: Airbyte, Ragie, Glean connectors

Access controlOpenFGA

ReBAC (relationship-based access) models the real permissions graph: user → member_of → group → can_read → page. Slack channels, Notion page shares, and GitHub org teams all map cleanly to ReBAC relations. Filter at retrieval time, never after the LLM has seen the content.

Alternatives: SpiceDB, Oso, Custom Postgres graph

ChunkingSource-specific chunkers

Slack thread-level, Notion heading-level, Linear ticket-level, GitHub PR-level. A 512-token fixed-chunker on Slack would cut across turn boundaries, destroying context. Every source has a native semantic unit — use it.

Alternatives: Unified semantic chunker, Fixed-size

EmbeddingsVoyage-3-large

Handles the heterogeneous mix (chat, docs, tickets, code, commit messages) better than single-domain embeddings. Top MTEB general-purpose model in 2026.

Alternatives: OpenAI text-embedding-3-large, Cohere Embed v3

Vector DBQdrant with ACL payload filters

Qdrant’s payload filtering is fast enough to enforce per-user ACL arrays at query time without degrading p95 latency. Also has native BM25 for hybrid. Pinecone works but metadata-filter perf on large arrays is worse.

Alternatives: Weaviate, Pinecone, pgvector

RerankerVoyage Rerank 2.5

Handles mixed-source candidate sets well — knows that a Slack thread and a Notion page can both answer a question, and ranks by relevance not source type. 200ms latency, 30-40% relevance lift.

Alternatives: Cohere Rerank v3

Answer LLMClaude Sonnet 4

Best at following ‘cite source type + link’ format without inventing message IDs. Handles chat-style content (Slack) and structured content (Notion) in the same answer cleanly.

Alternatives: GPT-4o, Gemini 2.5 Pro

Cost at each scale

Prototype

50-person team · 2k queries/mo

$260/mo

Connector infra (self-hosted)$40

Redpanda serverless$50

Initial embedding (~100k chunks)$15

Incremental reindex$5

Query embeddings (2k)$1

Voyage Rerank 2.5 (2k)$3

Claude Sonnet 4 (2k × ~4k tok)$40

Qdrant Cloud starter$30

OpenFGA + hosting + observability$76

Startup

500-person team · 40k queries/mo

$3,800/mo

Connector infra (HA)$350

Redpanda standard$400

Ongoing embedding (~2M chunks, churn)$100

Query embeddings (40k)$10

Voyage Rerank 2.5 (40k)$80

Claude Sonnet 4 (40k × ~5k tok)$1,200

Qdrant Cloud standard$500

OpenFGA + ACL sync workers$400

Eval + observability + hosting$760

Scale

5000-person team · 800k queries/mo

$38,000/mo

Connector infra (multi-tenant HA)$3,500

Redpanda Cloud Pro$2,500

Embeddings (~20M chunks ongoing)$900

Query embeddings (800k)$200

Voyage Rerank 2.5 (800k)$1,600

Claude Sonnet 4 answers$16,000

Qdrant Enterprise (self-hosted)$4,500

OpenFGA + ACL sync + audit logs$3,500

Eval + observability + SRE + hosting$5,300

Latency budget

Total P50: 17,220ms

Total P95: 64,100ms

Webhook → searchable (async)

15000ms · 60000ms p95

ACL group lookup for user

20ms · 60ms p95

Query embedding

80ms · 180ms p95

Hybrid retrieval with ACL filter

140ms · 320ms p95

Rerank to top-6

180ms · 340ms p95

LLM answer (streamed)

1800ms · 3200ms p95

Median

P95

Tradeoffs

Pre-filter ACL vs post-filter

Pre-filter is the only safe choice. Post-filter (let the LLM see everything, hide forbidden citations) leaks content the LLM can still paraphrase — and it has happened in real breaches. Pre-filter in the vector DB, audit every retrieval with user_id, and accept the small p95 latency penalty.

Build connectors vs buy Glean/Guru/Stack AI

Glean is $15-30/user/month. At 500 users that is $10-15k/month — more than the self-built stack. Building your own takes 3-6 engineer-months and ongoing connector maintenance (Notion’s API changes yearly). Break-even at ~300 users if you have the eng capacity. Under 200, buy. Above 1000, build if differentiation matters.

Source-specific chunkers vs unified chunker

Source-specific adds code complexity but lifts answer quality 20-30%. A unified semantic chunker is simpler and still works — but Slack threads and Notion pages have genuinely different structure, and ignoring that costs you. Start unified, move to source-specific once you have evals that show the gap.

Failure modes & guardrails

Contractor sees protected salary discussion from #leadership

Mitigation: ACL filter runs BEFORE vector search, not after. Every chunk stores an `acl_users` array and `acl_groups` array. Query applies `acl_users contains user_id OR acl_groups && user_groups`. Log every retrieval with user_id for audit; run weekly automated red-team queries as a canary.

Permissions change but index is stale

Mitigation: Subscribe to Slack member_joined_channel / Notion page share events / GitHub team membership events. On each, enqueue an ACL-rewrite job for all affected chunks. Do not wait for the next content update. For Google Drive, poll ACLs every 15 min — its webhook story is still weak.

Deleted content is still retrievable

Mitigation: Every connector emits delete events. Consumer immediately removes vectors from Qdrant and marks rows as tombstoned in the metadata store. A nightly sweep reconciles against the source (list all Notion pages, diff, purge). Deletion SLO: under 5 minutes p95 for privacy, 1 minute p95 for legal-hold sources.

Incremental index misses Slack edits or reactions

Mitigation: Subscribe to message_changed, message_deleted, reaction_added/removed. Treat edits as upsert (same source_id, new content + updated_at). For reactions, store counts as metadata and boost retrieval score slightly — a thread with 20 thumbs-up is usually canonical.

LLM answers with a link to content it fabricated

Mitigation: Force the model to cite (source, source_id) pairs. Post-generation, validate every source_id appears in the retrieved context. Reject answers that cite unknown IDs. The UI only renders links it has verified — never pass-through a model-emitted URL without validation.

Frequently asked questions

How is this different from Glean, Guru, or Stack AI?

Glean and Guru are polished managed products — $15-30/user/month all-in. This reference architecture is what they run internally, adapted for teams who want to build instead of buy. Break-even is around 300 users: below that, buy Glean; above 1000, build if search is differentiating. In between, depends on eng capacity.

How do I enforce Slack channel permissions?

For each Slack message, store an ACL array = [channel_id, channel_type]. For public channels, everyone in the workspace is allowed. For private channels, call conversations.members and store membership. Subscribe to member_joined_channel and member_left_channel — on each event, update the ACL on every message from that channel. Never post-filter.

Polling vs webhooks for ingestion?

Webhooks always, where available. Slack, Notion, Linear, GitHub all have solid event webhooks with deletion events. Polling is 5-30 minutes stale and misses deletes — content deleted at 10am will still answer queries at 2pm, which is a privacy and legal exposure. Google Drive is the exception — its webhook is flaky, so poll ACLs every 15 min.

Which embedding model handles this mix best?

Voyage-3-large is the 2026 default — handles the heterogeneous mix of chat, docs, tickets, code, and commit messages better than specialized single-domain embeddings. OpenAI text-embedding-3-large and Cohere Embed v3 are viable alternatives and trail by only ~3-5%.

How do I keep the index cost down?

Three levers. (1) Skip bot messages and CI noise (40%+ of Slack volume). (2) Chunk at the source-native unit (thread, page, ticket) instead of fixed-size — fewer vectors. (3) Time-tier: the last 90 days hot in Qdrant, older in cold storage rehydrated on demand. Most internal queries hit the recent hot tier.

How fast is incremental indexing?

Webhook to searchable lands at 15-60 seconds p95 with a Kafka/Redpanda-backed pipeline. That is fast enough that ‘Ana posted this a minute ago’ queries just work. Full backfills happen once at install and rarely after — webhook-driven incremental is the steady state.

Can the LLM see content the user can’t?

Not if you pre-filter. Every chunk stores acl_users + acl_groups; retrieval filters on these before similarity search; the LLM only sees what the user can see. Post-filtering (the LLM sees everything, you hide forbidden citations) is unsafe — the model can still paraphrase what it saw. Pre-filter, always.

What’s the best answer LLM for this?

Claude Sonnet 4. It handles chat-style content (Slack) and structured content (Notion) in the same answer without voice drift, follows citation format reliably, and rarely invents message IDs. GPT-4o is close behind. Gemini 2.5 Pro helps when you need to reason across very large contexts (‘summarize every discussion about the Q3 launch’).

Architectures

Enterprise Document Search

Reference architecture for semantic search across 1M+ enterprise documents (PDFs, Confluence, Notion, Google D...

Customer Knowledge Base Chatbot

Reference architecture for a high-volume help-center chatbot over 10k support articles. Zendesk-style, cheap p...

Models mentioned

claude-sonnet-4 gpt-4o

Tools mentioned

qdrant openfga voyage