Reference Architecture · agent

Meeting Notetaker Agent

Last updated: April 16, 2026

Quick answer

The production stack uses Deepgram Nova-3 or Whisper v3 for transcription with speaker diarization, Claude Sonnet 4 for structured summary + action item extraction, and Haiku 4 for incremental live summarization during the call. Expect $0.10–$0.35 per hour of audio end-to-end, 99.1%+ transcript WER on US English and 96–98% on accented speech. Action items shipped with confidence scores and owner attribution.

The problem

Teams waste hours on meeting notes — or worse, ship decisions without notes at all. You need an agent that joins a call (or ingests a recording), produces a high-quality transcript with speaker labels, extracts concrete action items with owners, and writes a tight summary that's useful a week later. Hard parts: accent robustness, speaker diarization, extracting decisions (not just topics), and not hallucinating action items that no one committed to.

Architecture

input

llm

data

infra

output

Audio Capture

Bot joins Zoom/Meet/Teams via native API or browser automation. Records 16kHz mono per-speaker where possible.

Alternatives: Recall.ai (meeting bot), Zoom RTMS, Upload a file

Speech-to-Text + Diarization

Streaming transcript with speaker labels. Nova-3 gives sub-300ms latency and built-in diarization.

Alternatives: AssemblyAI Universal-2, Whisper v3 + pyannote, AWS Transcribe

Live Summarizer

Runs every ~60s on rolling 5-min transcript window. Produces 'where we are' summary for late joiners and a running agenda.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

Speaker Identity Resolver

Maps anonymous speaker labels (Speaker 0, 1, 2) to real names via meeting invite attendees, voice profile, and introduction parsing.

Alternatives: Google Speaker ID, Pyannote embedding, Manual post-meeting

Post-Call Summarizer

Full transcript → structured output: TLDR, decisions, action items (with owner + due date), open questions, quoted highlights.

Alternatives: GPT-4o, Gemini 2.5 Pro

Action Item Extractor

Second pass specifically on action items — requires explicit commitment language from the transcript, outputs confidence score.

Alternatives: GPT-4o, Fine-tuned Haiku

Transcript + Summary Storage

Stores transcript (searchable), summaries, and embeddings for later semantic search across meetings.

Alternatives: Supabase + pgvector, S3 + Typesense

Summary Delivery

Posts summary to Slack/email, pushes action items to Linear/Asana/Jira, creates follow-up tasks in CRM.

Alternatives: Slack bot, Linear API, Notion database

The stack

STTDeepgram Nova-3

Nova-3 leads on streaming latency (sub-300ms), accented-English WER, and built-in diarization quality. AssemblyAI is competitive. Self-hosted Whisper is viable at scale but requires GPU ops.

Alternatives: AssemblyAI Universal-2, Whisper v3 Large, AWS Transcribe

Summarizer LLMClaude Sonnet 4

Sonnet 4 produces cleaner structured output and fabricates action items much less often than GPT-4o on meeting transcripts — verified on real-world eval sets.

Alternatives: GPT-4o, Gemini 2.5 Pro

Live summarizerClaude Haiku 4

Runs every 60s on rolling window. Speed and cost matter more than peak quality.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

Meeting botRecall.ai

Recall.ai abstracts Zoom/Meet/Teams/Webex bot joining with one API. Building this yourself takes months of platform-specific edge cases.

Alternatives: Fireflies API, Custom via Zoom SDK

Storage + searchPostgres + pgvector

Meeting search is the killer feature at scale ('what did we decide about pricing last month'). Embed each meeting's summary + quotable moments, not every turn.

Alternatives: Supabase, Turbopuffer

IntegrationsLinear + Slack native

Action items are useless unless they land where the owner already works. Support 2–3 destinations deeply rather than 10 shallowly.

Alternatives: Asana, Jira, Notion

EvalsBraintrust + human review

You must eval against real meeting transcripts with known action items — synthetic data misses the 'X said they'd think about it' ambiguity.

Alternatives: Langfuse, Humanloop

Cost at each scale

Prototype

100 meetings/mo, ~80 hrs audio

$75/mo

Deepgram Nova-3 (80 hrs)$35

Sonnet 4 summaries$18

Haiku 4 live summarizer$4

Recall.ai starter$15

Supabase + infra$3

Startup

5,000 meetings/mo, ~4,000 hrs audio

$3,900/mo

Deepgram Nova-3 (4k hrs)$1,700

Sonnet 4 summaries (cached)$750

Haiku 4 live summarizer$180

Recall.ai growth$800

Supabase Pro + pgvector$180

Infra + observability$290

Scale

100,000 meetings/mo, ~80,000 hrs audio

$62,000/mo

Deepgram Enterprise (80k hrs)$28,000

Sonnet 4 summaries (heavy caching)$14,000

Haiku 4 live summarizer$3,200

Recall.ai Enterprise$9,000

Postgres + vector infra$4,500

Compute + observability$3,300

Latency budget

Total P50: 17,880ms

Total P95: 37,150ms

STT streaming lag (live)

280ms · 650ms p95

Live summary (every 60s)

900ms · 1800ms p95

Speaker ID resolution

400ms · 1200ms p95

Full-meeting Sonnet summary (60-min call)

11000ms · 22000ms p95

Action item extraction pass

4500ms · 9000ms p95

Delivery to Slack/Linear

800ms · 2500ms p95

Median

P95

Tradeoffs

Streaming vs batch STT

Streaming enables live captions and live summaries but costs 2–3x batch. For most workflows (post-meeting summary), batch after the call is fine and cheaper. Use streaming only when you're actively surfacing context during the call.

Deepgram vs self-hosted Whisper

At <10k hrs/month, Deepgram is cheaper and better. Above 50k hrs/month, self-hosted Whisper v3 on L40S or H100 GPUs hits lower unit cost but requires ops, GPU capacity planning, and gives up some WER on accents without fine-tuning.

One-shot summary vs extract-then-summarize

Feeding the full transcript and asking for everything at once is simpler but more lossy. Two-pass (first extract decisions + action items as bullets, then summarize) yields measurably better action item recall and fewer hallucinations, at ~1.5x cost.

Failure modes & guardrails

Hallucinated action items no one actually committed to

Mitigation: Require each action item to include a verbatim quote from the transcript as source. Reject extractions where the quote doesn't contain explicit commitment language ('I'll', 'we should', 'let me', 'by Friday'). Run a secondary Sonnet pass as LLM-as-judge.

Wrong speaker attribution on crosstalk

Mitigation: Use Deepgram diarization with per-speaker channels where the platform provides them (Zoom does, Google Meet partially). For single-channel audio, voice-print match against attendees' past meetings; flag low-confidence attributions in the summary.

Summary misses the one key decision

Mitigation: Add a dedicated 'decisions' extraction pass with a tight rubric (a decision = statement + agreement from >1 speaker). Run against a golden eval set of 200 meetings with human-labeled decisions before any prompt change ships.

Privacy breach — summary delivered to wrong channel

Mitigation: Never auto-deliver across org boundaries — internal summaries never DM external attendees. Per-meeting ACL derived from calendar invite. Default to private (invitee-only) delivery; user explicitly opts in to broader channels.

Audio with thick accent or technical jargon drops WER

Mitigation: Enable Deepgram's custom vocabulary with product/person/company names from the meeting invite and past transcripts. Prompt the summarizer with 'likely typos: X→Y' hints derived from STT confidence scores below 0.7.

Frequently asked questions

Which STT service is best for meetings in 2026?

Deepgram Nova-3 and AssemblyAI Universal-2 are the two real choices. Nova-3 edges ahead on streaming latency and multilingual accented English. Whisper v3 Large is competitive and cheaper at scale, but diarization requires a separate model (pyannote).

Why Claude Sonnet 4 over GPT-4o for summaries?

On meeting-specific evals, Sonnet 4 fabricates action items less frequently and produces cleaner structured JSON on long inputs. GPT-4o is close on general summary quality but has higher variance on action item accuracy.

How do you handle 2-hour meetings that exceed context windows?

Two-pass: chunk into 20-min windows, summarize each with Haiku 4, then feed chunk summaries + raw action item extractions into Sonnet 4 for the final structured output. Don't try to fit 2 hours verbatim into one call even with 1M-context models — quality drops on the middle chunks.

Can it replace Fireflies or Otter?

Yes if you're willing to own the integration layer — the LLM-plus-STT stack described here is what those tools are. Fireflies/Otter give you the product polish (mobile apps, search UI, SSO) for a premium. Build for internal use or when you need tight integration with your existing tools.

How accurate are the action items?

With the two-pass extract-then-verify pattern, we see ~85–90% precision and ~75–85% recall on real meeting data. The extract-only, no-quote-grounding pattern drops precision to 60–70% — do not skip the grounding requirement.

Does it work for non-English meetings?

Deepgram Nova-3 and AssemblyAI cover 30+ languages with varying quality. For non-English meetings, summarize in the original language then optionally translate — translating the transcript first introduces meaning drift that the summary inherits.

How do you deal with compliance (HIPAA, GDPR)?

Use Deepgram's HIPAA tier and Anthropic zero-retention endpoints. Allow per-org data residency (EU-only, US-only) via Bedrock. Retention policy: delete audio within 7 days, transcripts user-configurable, summaries typed-only with no raw transcript by default for HIPAA tenants.

Architectures

Email Triage Agent

Reference architecture for an LLM agent that sorts your inbox, drafts replies, and flags priority mail. Models...

Calendar Scheduling Agent

Reference architecture for an agent that parses availability, books meetings across timezones, and handles res...

Research Agent

Reference architecture for a multi-step research agent that searches the web, synthesizes sources, and produce...

Models mentioned

claude-sonnet-4 claude-haiku-4 gpt-4o

Tools mentioned

deepgram recall-ai pgvector