Reference Architecture · rag
Slack + Notion Internal Search
Last updated: April 16, 2026
Quick answer
Run source-specific connectors that write to a Kafka topic, normalize into a shared chunk schema with ACL metadata (user_ids or group_ids that can read), embed with Voyage-3-large, store in Qdrant with mandatory ACL filter payloads, rerank with Voyage Rerank 2.5, and synthesize with Claude Sonnet 4. Expect $0.08 to $0.25 per query. The entire product is permissions — get them wrong and you leak salary discussions to interns. Webhooks are mandatory; polling lags by hours and misses deletions.
The problem
Your team’s knowledge is scattered across Slack threads, Notion pages, Linear tickets, GitHub PRs, and Google Docs. An engineer asks ‘what did we decide about the migration rollback plan’ and the answer lives in a Slack huddle summary from 3 weeks ago. You need Glean-style search that respects per-user permissions (a contractor can’t see the board channel), indexes incrementally via webhooks (no nightly full reindex), and cites the exact message or page — all answering in under 3 seconds.
Architecture
Source Connectors + Change Queue
Webhook listeners + backfill workers for Slack, Notion, Linear, Google Drive, and GitHub. Emit normalized create/update/delete events onto a Kafka/Redpanda topic that decouples ingest from indexing — consumers handle reindex, ACL update, and deletion propagation independently.
Alternatives: Airbyte, Ragie, Nuclia, Kafka, Glean connectors (paid)
ACL Resolver
For each ingested item, computes which user_ids + group_ids can read it. Slack channel membership, Notion page shares, Google Drive ACLs, GitHub repo access. Snapshots permission graph for fast filtering.
Alternatives: SpiceDB, Oso, Custom graph in Postgres
Source-aware Chunker
Slack: thread-level. Notion: heading-level blocks. Linear: ticket + comments. GitHub: PR description + reviews. Each source has its own semantic unit — never chunk Slack by 512 tokens.
Alternatives: Contextual retrieval, Fixed-size, Semantic chunking
Embedding Model
Voyage-3-large — best general embeddings, handles the mix of chat, docs, code, and tickets well.
Alternatives: OpenAI text-embedding-3-large, Cohere Embed v3
Vector DB (ACL-indexed)
Qdrant with ACL arrays as payload. A query filters `acl_users contains user_id OR acl_groups && user_groups` BEFORE similarity. Zero-trust: the LLM never sees content the user can’t read.
Alternatives: Weaviate, pgvector, Pinecone
Permissions-aware Retriever
Pulls the user’s group memberships from the ACL resolver, filters the vector DB, runs hybrid (dense + BM25), and returns top 40 candidates.
Alternatives: Post-filter (unsafe), Dense-only, Weighted hybrid
Reranker
Voyage Rerank 2.5 picks top 6 from 40. Handles mixed-source candidates (a Slack thread, a Notion page, a Linear ticket) correctly.
Alternatives: Cohere Rerank v3, Jina Reranker v2
Answer Synthesizer
Claude Sonnet 4 — best at following citation format and disambiguating similar-looking threads. Answers with inline links to Slack/Notion/Linear/GitHub.
Alternatives: GPT-4o, Gemini 2.5 Pro
Search UI
Slack slash command + web app + browser extension. Answer streams with clickable citations (message link, page link, ticket link). Deep links respect permissions and never leak protected URLs.
Alternatives: Slack bot only, Raycast extension, Chrome extension
The stack
Webhooks give sub-minute freshness and, critically, deletion events. Polling misses deletes — a Notion page deleted at 10am will still answer queries at 2pm. For Slack, Notion, Linear, and GitHub, webhooks are mandatory. Use polling only for Google Drive (still lacks a mature delete webhook).
Alternatives: Airbyte, Ragie, Glean connectors
ReBAC (relationship-based access) models the real permissions graph: user → member_of → group → can_read → page. Slack channels, Notion page shares, and GitHub org teams all map cleanly to ReBAC relations. Filter at retrieval time, never after the LLM has seen the content.
Alternatives: SpiceDB, Oso, Custom Postgres graph
Slack thread-level, Notion heading-level, Linear ticket-level, GitHub PR-level. A 512-token fixed-chunker on Slack would cut across turn boundaries, destroying context. Every source has a native semantic unit — use it.
Alternatives: Unified semantic chunker, Fixed-size
Handles the heterogeneous mix (chat, docs, tickets, code, commit messages) better than single-domain embeddings. Top MTEB general-purpose model in 2026.
Alternatives: OpenAI text-embedding-3-large, Cohere Embed v3
Qdrant’s payload filtering is fast enough to enforce per-user ACL arrays at query time without degrading p95 latency. Also has native BM25 for hybrid. Pinecone works but metadata-filter perf on large arrays is worse.
Alternatives: Weaviate, Pinecone, pgvector
Handles mixed-source candidate sets well — knows that a Slack thread and a Notion page can both answer a question, and ranks by relevance not source type. 200ms latency, 30-40% relevance lift.
Alternatives: Cohere Rerank v3
Best at following ‘cite source type + link’ format without inventing message IDs. Handles chat-style content (Slack) and structured content (Notion) in the same answer cleanly.
Alternatives: GPT-4o, Gemini 2.5 Pro
Cost at each scale
Prototype
50-person team · 2k queries/mo
$260/mo
Startup
500-person team · 40k queries/mo
$3,800/mo
Scale
5000-person team · 800k queries/mo
$38,000/mo
Latency budget
Tradeoffs
Pre-filter ACL vs post-filter
Pre-filter is the only safe choice. Post-filter (let the LLM see everything, hide forbidden citations) leaks content the LLM can still paraphrase — and it has happened in real breaches. Pre-filter in the vector DB, audit every retrieval with user_id, and accept the small p95 latency penalty.
Build connectors vs buy Glean/Guru/Stack AI
Glean is $15-30/user/month. At 500 users that is $10-15k/month — more than the self-built stack. Building your own takes 3-6 engineer-months and ongoing connector maintenance (Notion’s API changes yearly). Break-even at ~300 users if you have the eng capacity. Under 200, buy. Above 1000, build if differentiation matters.
Source-specific chunkers vs unified chunker
Source-specific adds code complexity but lifts answer quality 20-30%. A unified semantic chunker is simpler and still works — but Slack threads and Notion pages have genuinely different structure, and ignoring that costs you. Start unified, move to source-specific once you have evals that show the gap.
Failure modes & guardrails
Contractor sees protected salary discussion from #leadership
Mitigation: ACL filter runs BEFORE vector search, not after. Every chunk stores an `acl_users` array and `acl_groups` array. Query applies `acl_users contains user_id OR acl_groups && user_groups`. Log every retrieval with user_id for audit; run weekly automated red-team queries as a canary.
Permissions change but index is stale
Mitigation: Subscribe to Slack member_joined_channel / Notion page share events / GitHub team membership events. On each, enqueue an ACL-rewrite job for all affected chunks. Do not wait for the next content update. For Google Drive, poll ACLs every 15 min — its webhook story is still weak.
Deleted content is still retrievable
Mitigation: Every connector emits delete events. Consumer immediately removes vectors from Qdrant and marks rows as tombstoned in the metadata store. A nightly sweep reconciles against the source (list all Notion pages, diff, purge). Deletion SLO: under 5 minutes p95 for privacy, 1 minute p95 for legal-hold sources.
Incremental index misses Slack edits or reactions
Mitigation: Subscribe to message_changed, message_deleted, reaction_added/removed. Treat edits as upsert (same source_id, new content + updated_at). For reactions, store counts as metadata and boost retrieval score slightly — a thread with 20 thumbs-up is usually canonical.
LLM answers with a link to content it fabricated
Mitigation: Force the model to cite (source, source_id) pairs. Post-generation, validate every source_id appears in the retrieved context. Reject answers that cite unknown IDs. The UI only renders links it has verified — never pass-through a model-emitted URL without validation.
Frequently asked questions
How is this different from Glean, Guru, or Stack AI?
Glean and Guru are polished managed products — $15-30/user/month all-in. This reference architecture is what they run internally, adapted for teams who want to build instead of buy. Break-even is around 300 users: below that, buy Glean; above 1000, build if search is differentiating. In between, depends on eng capacity.
How do I enforce Slack channel permissions?
For each Slack message, store an ACL array = [channel_id, channel_type]. For public channels, everyone in the workspace is allowed. For private channels, call conversations.members and store membership. Subscribe to member_joined_channel and member_left_channel — on each event, update the ACL on every message from that channel. Never post-filter.
Polling vs webhooks for ingestion?
Webhooks always, where available. Slack, Notion, Linear, GitHub all have solid event webhooks with deletion events. Polling is 5-30 minutes stale and misses deletes — content deleted at 10am will still answer queries at 2pm, which is a privacy and legal exposure. Google Drive is the exception — its webhook is flaky, so poll ACLs every 15 min.
Which embedding model handles this mix best?
Voyage-3-large is the 2026 default — handles the heterogeneous mix of chat, docs, tickets, code, and commit messages better than specialized single-domain embeddings. OpenAI text-embedding-3-large and Cohere Embed v3 are viable alternatives and trail by only ~3-5%.
How do I keep the index cost down?
Three levers. (1) Skip bot messages and CI noise (40%+ of Slack volume). (2) Chunk at the source-native unit (thread, page, ticket) instead of fixed-size — fewer vectors. (3) Time-tier: the last 90 days hot in Qdrant, older in cold storage rehydrated on demand. Most internal queries hit the recent hot tier.
How fast is incremental indexing?
Webhook to searchable lands at 15-60 seconds p95 with a Kafka/Redpanda-backed pipeline. That is fast enough that ‘Ana posted this a minute ago’ queries just work. Full backfills happen once at install and rarely after — webhook-driven incremental is the steady state.
Can the LLM see content the user can’t?
Not if you pre-filter. Every chunk stores acl_users + acl_groups; retrieval filters on these before similarity search; the LLM only sees what the user can see. Post-filtering (the LLM sees everything, you hide forbidden citations) is unsafe — the model can still paraphrase what it saw. Pre-filter, always.
What’s the best answer LLM for this?
Claude Sonnet 4. It handles chat-style content (Slack) and structured content (Notion) in the same answer without voice drift, follows citation format reliably, and rarely invents message IDs. GPT-4o is close behind. Gemini 2.5 Pro helps when you need to reason across very large contexts (‘summarize every discussion about the Q3 launch’).