Reference Architecture · agent
Meeting Notetaker Agent
Last updated: April 16, 2026
Quick answer
The production stack uses Deepgram Nova-3 or Whisper v3 for transcription with speaker diarization, Claude Sonnet 4 for structured summary + action item extraction, and Haiku 4 for incremental live summarization during the call. Expect $0.10–$0.35 per hour of audio end-to-end, 99.1%+ transcript WER on US English and 96–98% on accented speech. Action items shipped with confidence scores and owner attribution.
The problem
Teams waste hours on meeting notes — or worse, ship decisions without notes at all. You need an agent that joins a call (or ingests a recording), produces a high-quality transcript with speaker labels, extracts concrete action items with owners, and writes a tight summary that's useful a week later. Hard parts: accent robustness, speaker diarization, extracting decisions (not just topics), and not hallucinating action items that no one committed to.
Architecture
Audio Capture
Bot joins Zoom/Meet/Teams via native API or browser automation. Records 16kHz mono per-speaker where possible.
Alternatives: Recall.ai (meeting bot), Zoom RTMS, Upload a file
Speech-to-Text + Diarization
Streaming transcript with speaker labels. Nova-3 gives sub-300ms latency and built-in diarization.
Alternatives: AssemblyAI Universal-2, Whisper v3 + pyannote, AWS Transcribe
Live Summarizer
Runs every ~60s on rolling 5-min transcript window. Produces 'where we are' summary for late joiners and a running agenda.
Alternatives: GPT-4o-mini, Gemini 2.0 Flash
Speaker Identity Resolver
Maps anonymous speaker labels (Speaker 0, 1, 2) to real names via meeting invite attendees, voice profile, and introduction parsing.
Alternatives: Google Speaker ID, Pyannote embedding, Manual post-meeting
Post-Call Summarizer
Full transcript → structured output: TLDR, decisions, action items (with owner + due date), open questions, quoted highlights.
Alternatives: GPT-4o, Gemini 2.5 Pro
Action Item Extractor
Second pass specifically on action items — requires explicit commitment language from the transcript, outputs confidence score.
Alternatives: GPT-4o, Fine-tuned Haiku
Transcript + Summary Storage
Stores transcript (searchable), summaries, and embeddings for later semantic search across meetings.
Alternatives: Supabase + pgvector, S3 + Typesense
Summary Delivery
Posts summary to Slack/email, pushes action items to Linear/Asana/Jira, creates follow-up tasks in CRM.
Alternatives: Slack bot, Linear API, Notion database
The stack
Nova-3 leads on streaming latency (sub-300ms), accented-English WER, and built-in diarization quality. AssemblyAI is competitive. Self-hosted Whisper is viable at scale but requires GPU ops.
Alternatives: AssemblyAI Universal-2, Whisper v3 Large, AWS Transcribe
Sonnet 4 produces cleaner structured output and fabricates action items much less often than GPT-4o on meeting transcripts — verified on real-world eval sets.
Alternatives: GPT-4o, Gemini 2.5 Pro
Runs every 60s on rolling window. Speed and cost matter more than peak quality.
Alternatives: GPT-4o-mini, Gemini 2.0 Flash
Recall.ai abstracts Zoom/Meet/Teams/Webex bot joining with one API. Building this yourself takes months of platform-specific edge cases.
Alternatives: Fireflies API, Custom via Zoom SDK
Meeting search is the killer feature at scale ('what did we decide about pricing last month'). Embed each meeting's summary + quotable moments, not every turn.
Alternatives: Supabase, Turbopuffer
Action items are useless unless they land where the owner already works. Support 2–3 destinations deeply rather than 10 shallowly.
Alternatives: Asana, Jira, Notion
You must eval against real meeting transcripts with known action items — synthetic data misses the 'X said they'd think about it' ambiguity.
Alternatives: Langfuse, Humanloop
Cost at each scale
Prototype
100 meetings/mo, ~80 hrs audio
$75/mo
Startup
5,000 meetings/mo, ~4,000 hrs audio
$3,900/mo
Scale
100,000 meetings/mo, ~80,000 hrs audio
$62,000/mo
Latency budget
Tradeoffs
Streaming vs batch STT
Streaming enables live captions and live summaries but costs 2–3x batch. For most workflows (post-meeting summary), batch after the call is fine and cheaper. Use streaming only when you're actively surfacing context during the call.
Deepgram vs self-hosted Whisper
At <10k hrs/month, Deepgram is cheaper and better. Above 50k hrs/month, self-hosted Whisper v3 on L40S or H100 GPUs hits lower unit cost but requires ops, GPU capacity planning, and gives up some WER on accents without fine-tuning.
One-shot summary vs extract-then-summarize
Feeding the full transcript and asking for everything at once is simpler but more lossy. Two-pass (first extract decisions + action items as bullets, then summarize) yields measurably better action item recall and fewer hallucinations, at ~1.5x cost.
Failure modes & guardrails
Hallucinated action items no one actually committed to
Mitigation: Require each action item to include a verbatim quote from the transcript as source. Reject extractions where the quote doesn't contain explicit commitment language ('I'll', 'we should', 'let me', 'by Friday'). Run a secondary Sonnet pass as LLM-as-judge.
Wrong speaker attribution on crosstalk
Mitigation: Use Deepgram diarization with per-speaker channels where the platform provides them (Zoom does, Google Meet partially). For single-channel audio, voice-print match against attendees' past meetings; flag low-confidence attributions in the summary.
Summary misses the one key decision
Mitigation: Add a dedicated 'decisions' extraction pass with a tight rubric (a decision = statement + agreement from >1 speaker). Run against a golden eval set of 200 meetings with human-labeled decisions before any prompt change ships.
Privacy breach — summary delivered to wrong channel
Mitigation: Never auto-deliver across org boundaries — internal summaries never DM external attendees. Per-meeting ACL derived from calendar invite. Default to private (invitee-only) delivery; user explicitly opts in to broader channels.
Audio with thick accent or technical jargon drops WER
Mitigation: Enable Deepgram's custom vocabulary with product/person/company names from the meeting invite and past transcripts. Prompt the summarizer with 'likely typos: X→Y' hints derived from STT confidence scores below 0.7.
Frequently asked questions
Which STT service is best for meetings in 2026?
Deepgram Nova-3 and AssemblyAI Universal-2 are the two real choices. Nova-3 edges ahead on streaming latency and multilingual accented English. Whisper v3 Large is competitive and cheaper at scale, but diarization requires a separate model (pyannote).
Why Claude Sonnet 4 over GPT-4o for summaries?
On meeting-specific evals, Sonnet 4 fabricates action items less frequently and produces cleaner structured JSON on long inputs. GPT-4o is close on general summary quality but has higher variance on action item accuracy.
How do you handle 2-hour meetings that exceed context windows?
Two-pass: chunk into 20-min windows, summarize each with Haiku 4, then feed chunk summaries + raw action item extractions into Sonnet 4 for the final structured output. Don't try to fit 2 hours verbatim into one call even with 1M-context models — quality drops on the middle chunks.
Can it replace Fireflies or Otter?
Yes if you're willing to own the integration layer — the LLM-plus-STT stack described here is what those tools are. Fireflies/Otter give you the product polish (mobile apps, search UI, SSO) for a premium. Build for internal use or when you need tight integration with your existing tools.
How accurate are the action items?
With the two-pass extract-then-verify pattern, we see ~85–90% precision and ~75–85% recall on real meeting data. The extract-only, no-quote-grounding pattern drops precision to 60–70% — do not skip the grounding requirement.
Does it work for non-English meetings?
Deepgram Nova-3 and AssemblyAI cover 30+ languages with varying quality. For non-English meetings, summarize in the original language then optionally translate — translating the transcript first introduces meaning drift that the summary inherits.
How do you deal with compliance (HIPAA, GDPR)?
Use Deepgram's HIPAA tier and Anthropic zero-retention endpoints. Allow per-org data residency (EU-only, US-only) via Bedrock. Retention policy: delete audio within 7 days, transcripts user-configurable, summaries typed-only with no raw transcript by default for HIPAA tenants.
Related
Architectures
Email Triage Agent
Reference architecture for an LLM agent that sorts your inbox, drafts replies, and flags priority mail. Models...
Calendar Scheduling Agent
Reference architecture for an agent that parses availability, books meetings across timezones, and handles res...
Research Agent
Reference architecture for a multi-step research agent that searches the web, synthesizes sources, and produce...