Reference Architecture · agent

Meeting Notetaker Agent

Last updated: April 16, 2026

Quick answer

The production stack uses Deepgram Nova-3 or Whisper v3 for transcription with speaker diarization, Claude Sonnet 4 for structured summary + action item extraction, and Haiku 4 for incremental live summarization during the call. Expect $0.10–$0.35 per hour of audio end-to-end, 99.1%+ transcript WER on US English and 96–98% on accented speech. Action items shipped with confidence scores and owner attribution.

The problem

Teams waste hours on meeting notes — or worse, ship decisions without notes at all. You need an agent that joins a call (or ingests a recording), produces a high-quality transcript with speaker labels, extracts concrete action items with owners, and writes a tight summary that's useful a week later. Hard parts: accent robustness, speaker diarization, extracting decisions (not just topics), and not hallucinating action items that no one committed to.

Architecture

streamingon endAudio CaptureINPUTSpeech-to-Text + DiarizationINFRALive SummarizerLLMSpeaker Identity ResolverINFRAPost-Call SummarizerLLMAction Item ExtractorLLMTranscript + Summary StorageDATASummary DeliveryOUTPUT
input
llm
data
infra
output

Audio Capture

Bot joins Zoom/Meet/Teams via native API or browser automation. Records 16kHz mono per-speaker where possible.

Alternatives: Recall.ai (meeting bot), Zoom RTMS, Upload a file

Speech-to-Text + Diarization

Streaming transcript with speaker labels. Nova-3 gives sub-300ms latency and built-in diarization.

Alternatives: AssemblyAI Universal-2, Whisper v3 + pyannote, AWS Transcribe

Live Summarizer

Runs every ~60s on rolling 5-min transcript window. Produces 'where we are' summary for late joiners and a running agenda.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

Speaker Identity Resolver

Maps anonymous speaker labels (Speaker 0, 1, 2) to real names via meeting invite attendees, voice profile, and introduction parsing.

Alternatives: Google Speaker ID, Pyannote embedding, Manual post-meeting

Post-Call Summarizer

Full transcript → structured output: TLDR, decisions, action items (with owner + due date), open questions, quoted highlights.

Alternatives: GPT-4o, Gemini 2.5 Pro

Action Item Extractor

Second pass specifically on action items — requires explicit commitment language from the transcript, outputs confidence score.

Alternatives: GPT-4o, Fine-tuned Haiku

Transcript + Summary Storage

Stores transcript (searchable), summaries, and embeddings for later semantic search across meetings.

Alternatives: Supabase + pgvector, S3 + Typesense

Summary Delivery

Posts summary to Slack/email, pushes action items to Linear/Asana/Jira, creates follow-up tasks in CRM.

Alternatives: Slack bot, Linear API, Notion database

The stack

STTDeepgram Nova-3

Nova-3 leads on streaming latency (sub-300ms), accented-English WER, and built-in diarization quality. AssemblyAI is competitive. Self-hosted Whisper is viable at scale but requires GPU ops.

Alternatives: AssemblyAI Universal-2, Whisper v3 Large, AWS Transcribe

Summarizer LLMClaude Sonnet 4

Sonnet 4 produces cleaner structured output and fabricates action items much less often than GPT-4o on meeting transcripts — verified on real-world eval sets.

Alternatives: GPT-4o, Gemini 2.5 Pro

Live summarizerClaude Haiku 4

Runs every 60s on rolling window. Speed and cost matter more than peak quality.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

Meeting botRecall.ai

Recall.ai abstracts Zoom/Meet/Teams/Webex bot joining with one API. Building this yourself takes months of platform-specific edge cases.

Alternatives: Fireflies API, Custom via Zoom SDK

Storage + searchPostgres + pgvector

Meeting search is the killer feature at scale ('what did we decide about pricing last month'). Embed each meeting's summary + quotable moments, not every turn.

Alternatives: Supabase, Turbopuffer

IntegrationsLinear + Slack native

Action items are useless unless they land where the owner already works. Support 2–3 destinations deeply rather than 10 shallowly.

Alternatives: Asana, Jira, Notion

EvalsBraintrust + human review

You must eval against real meeting transcripts with known action items — synthetic data misses the 'X said they'd think about it' ambiguity.

Alternatives: Langfuse, Humanloop

Cost at each scale

Prototype

100 meetings/mo, ~80 hrs audio

$75/mo

Deepgram Nova-3 (80 hrs)$35
Sonnet 4 summaries$18
Haiku 4 live summarizer$4
Recall.ai starter$15
Supabase + infra$3

Startup

5,000 meetings/mo, ~4,000 hrs audio

$3,900/mo

Deepgram Nova-3 (4k hrs)$1,700
Sonnet 4 summaries (cached)$750
Haiku 4 live summarizer$180
Recall.ai growth$800
Supabase Pro + pgvector$180
Infra + observability$290

Scale

100,000 meetings/mo, ~80,000 hrs audio

$62,000/mo

Deepgram Enterprise (80k hrs)$28,000
Sonnet 4 summaries (heavy caching)$14,000
Haiku 4 live summarizer$3,200
Recall.ai Enterprise$9,000
Postgres + vector infra$4,500
Compute + observability$3,300

Latency budget

Total P50: 17,880ms
Total P95: 37,150ms
STT streaming lag (live)
280ms · 650ms p95
Live summary (every 60s)
900ms · 1800ms p95
Speaker ID resolution
400ms · 1200ms p95
Full-meeting Sonnet summary (60-min call)
11000ms · 22000ms p95
Action item extraction pass
4500ms · 9000ms p95
Delivery to Slack/Linear
800ms · 2500ms p95
Median
P95

Tradeoffs

Streaming vs batch STT

Streaming enables live captions and live summaries but costs 2–3x batch. For most workflows (post-meeting summary), batch after the call is fine and cheaper. Use streaming only when you're actively surfacing context during the call.

Deepgram vs self-hosted Whisper

At <10k hrs/month, Deepgram is cheaper and better. Above 50k hrs/month, self-hosted Whisper v3 on L40S or H100 GPUs hits lower unit cost but requires ops, GPU capacity planning, and gives up some WER on accents without fine-tuning.

One-shot summary vs extract-then-summarize

Feeding the full transcript and asking for everything at once is simpler but more lossy. Two-pass (first extract decisions + action items as bullets, then summarize) yields measurably better action item recall and fewer hallucinations, at ~1.5x cost.

Failure modes & guardrails

Hallucinated action items no one actually committed to

Mitigation: Require each action item to include a verbatim quote from the transcript as source. Reject extractions where the quote doesn't contain explicit commitment language ('I'll', 'we should', 'let me', 'by Friday'). Run a secondary Sonnet pass as LLM-as-judge.

Wrong speaker attribution on crosstalk

Mitigation: Use Deepgram diarization with per-speaker channels where the platform provides them (Zoom does, Google Meet partially). For single-channel audio, voice-print match against attendees' past meetings; flag low-confidence attributions in the summary.

Summary misses the one key decision

Mitigation: Add a dedicated 'decisions' extraction pass with a tight rubric (a decision = statement + agreement from >1 speaker). Run against a golden eval set of 200 meetings with human-labeled decisions before any prompt change ships.

Privacy breach — summary delivered to wrong channel

Mitigation: Never auto-deliver across org boundaries — internal summaries never DM external attendees. Per-meeting ACL derived from calendar invite. Default to private (invitee-only) delivery; user explicitly opts in to broader channels.

Audio with thick accent or technical jargon drops WER

Mitigation: Enable Deepgram's custom vocabulary with product/person/company names from the meeting invite and past transcripts. Prompt the summarizer with 'likely typos: X→Y' hints derived from STT confidence scores below 0.7.

Frequently asked questions

Which STT service is best for meetings in 2026?

Deepgram Nova-3 and AssemblyAI Universal-2 are the two real choices. Nova-3 edges ahead on streaming latency and multilingual accented English. Whisper v3 Large is competitive and cheaper at scale, but diarization requires a separate model (pyannote).

Why Claude Sonnet 4 over GPT-4o for summaries?

On meeting-specific evals, Sonnet 4 fabricates action items less frequently and produces cleaner structured JSON on long inputs. GPT-4o is close on general summary quality but has higher variance on action item accuracy.

How do you handle 2-hour meetings that exceed context windows?

Two-pass: chunk into 20-min windows, summarize each with Haiku 4, then feed chunk summaries + raw action item extractions into Sonnet 4 for the final structured output. Don't try to fit 2 hours verbatim into one call even with 1M-context models — quality drops on the middle chunks.

Can it replace Fireflies or Otter?

Yes if you're willing to own the integration layer — the LLM-plus-STT stack described here is what those tools are. Fireflies/Otter give you the product polish (mobile apps, search UI, SSO) for a premium. Build for internal use or when you need tight integration with your existing tools.

How accurate are the action items?

With the two-pass extract-then-verify pattern, we see ~85–90% precision and ~75–85% recall on real meeting data. The extract-only, no-quote-grounding pattern drops precision to 60–70% — do not skip the grounding requirement.

Does it work for non-English meetings?

Deepgram Nova-3 and AssemblyAI cover 30+ languages with varying quality. For non-English meetings, summarize in the original language then optionally translate — translating the transcript first introduces meaning drift that the summary inherits.

How do you deal with compliance (HIPAA, GDPR)?

Use Deepgram's HIPAA tier and Anthropic zero-retention endpoints. Allow per-org data residency (EU-only, US-only) via Bedrock. Retention policy: delete audio within 7 days, transcripts user-configurable, summaries typed-only with no raw transcript by default for HIPAA tenants.

Related