Reference Architecture · multimodal

Video Summarization Pipeline

Last updated: April 16, 2026

Quick answer

The production stack transcribes audio with Whisper large-v3 or Deepgram Nova-3 (with diarization), samples 1-2 keyframes per scene and sends them to Gemini 2.5 Pro vision for visual context, merges transcript and visual descriptions, and uses Claude Sonnet 4 to generate chapters and summaries. For native multimodal processing on shorter videos (under 30 min), Gemini 2.5 Pro Video can ingest the whole thing in one call. Expect $0.40-$2 per hour of video at scale, with a 10-15 min processing window for a 1-hour video.

The problem

You have long-form video - YouTube uploads, Zoom meetings, webinar recordings, podcasts with video - and need to produce chapter markers, a searchable transcript with speaker labels, key-moment highlights, and a summary. Processing must be cheap enough for thousands of hours per day and accurate enough that the resulting chapters and summaries reflect what actually happened on screen, not just what was said.

Architecture

in-video searchVideo UploadINPUTAudio Extraction + NormalizationINFRASpeech-to-Text + DiarizationLLMScene Change DetectionINFRAFrame Vision DescriptionLLMMultimodal FusionINFRAChapter + Summary GeneratorLLMTranscript Embedding IndexDATAPlayer with ChaptersOUTPUT
input
llm
data
infra
output

Video Upload

Accepts video files via direct upload or URL (YouTube, Vimeo, Zoom cloud recording). Strips into audio + video streams.

Alternatives: Cloudflare Stream, Mux, AWS MediaConvert, YouTube ingestion

Audio Extraction + Normalization

ffmpeg to pull audio track, normalize volume, and segment into 30s chunks for parallel transcription.

Alternatives: ffmpeg, AWS MediaConvert, Mux audio API

Speech-to-Text + Diarization

Transcribes audio with word-level timestamps and speaker labels. Outputs a speaker-diarized transcript.

Alternatives: Deepgram Nova-3, AssemblyAI Universal-2, Google STT v2, Pyannote + Whisper

Scene Change Detection

Detects shot changes and samples 1-2 keyframes per scene. Avoids wasteful vision inference on talking-head footage where nothing changes.

Alternatives: PySceneDetect, ffmpeg scene filter, OpenCV

Frame Vision Description

For each sampled frame, generates a short description: what's on screen, whiteboard/slide text, key visual events.

Alternatives: GPT-4o vision, Claude Sonnet 4 vision, Gemini 2.0 Flash for low-cost

Multimodal Fusion

Merges transcript (by timestamp) with visual descriptions. Produces a unified timeline with text, speaker, and scene.

Alternatives: Custom merge script, LlamaIndex multimodal

Chapter + Summary Generator

Consumes the fused timeline and produces chapter markers, a concise summary, and key-moment highlights with timestamps.

Alternatives: GPT-4o, Gemini 2.5 Pro (long-context native)

Transcript Embedding Index

Embeds transcript chunks and indexes into a vector store so users can search inside the video ('find the part about pricing').

Alternatives: pgvector, Qdrant, Weaviate

Player with Chapters

Video player with chapter markers, transcript search, and click-to-jump on key moments.

Alternatives: Mux Player, Cloudflare Stream Player, Video.js, Custom React

The stack

Audio transcriptionWhisper large-v3 (self-hosted on GPU)

Whisper large-v3 is free to run, handles 99 languages, and is competitive with paid APIs on WER. Deepgram Nova-3 has better speaker diarization and lower latency but costs $0.0043/min. Use Whisper for batch, Deepgram for realtime.

Alternatives: Deepgram Nova-3, AssemblyAI Universal-2, Google STT v2

Speaker diarizationPyannote 3.1

Pyannote 3.1 is the strongest open-source diarizer and pairs cleanly with Whisper. Managed services (Deepgram, AssemblyAI) bundle diarization with transcription and are easier to run end-to-end if you don't want to manage GPU infra.

Alternatives: Deepgram diarization, AssemblyAI speaker labels

Vision captioningGemini 2.5 Pro vision

Gemini 2.5 Pro gives the best frame descriptions at $1.25/$5 per MTok and handles slides/whiteboards well. Gemini 2.0 Flash is 10x cheaper and fine for simple scene descriptions. Use Flash for talking-head footage, Pro for screen-share-heavy content.

Alternatives: GPT-4o vision, Claude Sonnet 4 vision, Gemini 2.0 Flash

Native multimodal (short videos)Gemini 2.5 Pro Video

Gemini 2.5 Pro can ingest up to 1 hour of video natively in one call, mixing visual and audio understanding. For videos under 30 min, this is simpler than a pipeline. Over 1 hour, you still need sampling and fusion.

Alternatives: GPT-4o with frame sampling, Claude Sonnet 4 with frames

Summary + chapter LLMClaude Sonnet 4

Sonnet 4 follows chapter-structure prompts reliably and produces consistent summary quality across hours-long inputs. Gemini 2.5 Pro shines when you need 1M+ token context for very long transcripts (multi-hour podcasts).

Alternatives: GPT-4o, Gemini 2.5 Pro

Video storage + deliveryMux or Cloudflare Stream

Mux and Cloudflare Stream handle ingest, transcoding, delivery, and analytics in one service. Roll-your-own with S3+CloudFront is cheaper at huge scale but takes months to get right.

Alternatives: AWS IVS, Vimeo OTT, Self-hosted + Bunny CDN

Cost at each scale

Prototype

100 hours/mo

$95/mo

Whisper self-hosted (small GPU)$20
Gemini 2.5 Pro vision (keyframes)$18
Claude Sonnet 4 summaries$22
Mux Stream (100 hours)$25
Pinecone free tier$0
Hosting$10

Startup

10,000 hours/mo

$7,500/mo

Deepgram Nova-3 transcription$2,600
Gemini 2.5 Pro vision (scene keyframes)$1,400
Claude Sonnet 4 summaries$1,800
Pinecone Standard$300
Mux Stream delivery$1,200
Observability + Hosting$200

Scale

500,000 hours/mo

$220,000/mo

Self-hosted Whisper cluster$36,000
Gemini 2.5 Pro + Flash vision$48,000
Claude Sonnet 4 summaries (cached)$42,000
Pinecone Enterprise$8,000
Mux / Cloudflare Stream$65,000
Storage + egress$14,000
SRE + observability$7,000

Latency budget

Total P50: 739,000ms
Total P95: 1,363,000ms
Audio extraction + chunking
4000ms · 9000ms p95
Whisper transcription (per 1h video, batched)
180000ms · 300000ms p95
Scene detect + keyframe sample
12000ms · 28000ms p95
Vision description (per 60 frames, batched)
45000ms · 90000ms p95
Chapter + summary (Sonnet 4)
18000ms · 36000ms p95
End-to-end per 1h video
480000ms · 900000ms p95
Median
P95

Tradeoffs

Native multimodal vs pipelined

Gemini 2.5 Pro Video can process up to 1 hour natively in a single API call - simpler code, slightly better multimodal reasoning. Downside: pricing for native video is 3-5x more per minute than a pipelined Whisper + frame-sampling approach, and you hit context limits past 1 hour. Pipeline for scale; native for short-form prototypes.

Frame sampling rate

Sampling every scene change is adequate for most content. Sampling every 5 seconds gives more visual detail but triples vision cost. For talking-head content (podcasts, interviews), 1 frame per 30s is enough. For tutorials with screen-share, sample every 5-10s to catch code/slide changes.

Whisper vs managed STT

Whisper large-v3 self-hosted is free per minute but requires GPU infra. Deepgram Nova-3 at $0.0043/min gives better diarization and zero ops burden. Break-even is around 5k hours/month - above that, self-hosting wins on unit economics if you have ML ops capacity.

Failure modes & guardrails

Speaker diarization swaps two speakers mid-conversation

Mitigation: Enforce a minimum speaker-segment length (3-5 seconds) to reduce false splits. When diarization confidence is low, mark the segment 'speaker unknown' rather than guessing. Post-process with Pyannote re-clustering on the full audio embedding to catch swaps.

Vision model misreads slide/whiteboard text

Mitigation: Run an OCR pass (Tesseract or Gemini) on detected slide frames and feed the raw OCR text alongside the vision description. Disagreements between OCR text and vision caption indicate a problem - flag for review or request a clearer frame.

Non-English audio transcribed incorrectly or mixed with English hallucinations

Mitigation: Always run language detection on a 60s audio sample before transcription. Pass the detected language explicitly to Whisper. Whisper is prone to English hallucinations on quiet/non-speech audio - filter out segments where WER confidence drops below 0.5.

Long video exceeds context window for summary generation

Mitigation: Use a map-reduce approach: summarize 10-15 min chunks individually, then produce a final summary from the chunk summaries. Gemini 2.5 Pro's 1M+ token context handles 5-8 hours of transcript directly - use it for podcasts, Sonnet 4 for shorter content with better quality.

Chapters generated do not match actual video structure

Mitigation: Eval chapter quality against human-labeled ground truth on 50-100 sample videos. Require chapters to have a clear topic change at the boundary. If the model splits mid-topic or misses an obvious chapter, adjust the prompt to use the visual scene changes as stronger hints.

Frequently asked questions

How long does it take to process a 1-hour video?

Pipelined with parallel Whisper + vision + summary: 6-10 min median on a well-configured cluster, P95 15 min. Native Gemini 2.5 Pro Video: 2-5 min for sub-1h videos. User expectation: if you process while the video is being watched, show partial transcript in under 2 min to feel realtime.

How much does video summarization cost?

At scale (500k hours/month), $0.40-$0.60 per hour of video using pipelined Whisper + Gemini Flash + Sonnet 4. Using native Gemini 2.5 Pro Video: $1.50-$2.50 per hour. Mux/Cloudflare Stream adds $0.05-$0.15 per hour for delivery. Full loaded: roughly $0.50-$3 per hour of video processed and served.

Can I skip vision processing and just use the transcript?

For podcasts and interviews, yes - visuals add little. For tutorials, slide decks, product demos, screen-share content, and visual-heavy YouTube: no. The transcript alone misses 30-50% of meaning. Always include vision for screen-heavy content.

Which LLM is best for summarization?

Claude Sonnet 4 produces the most consistent, structured summaries across video lengths. GPT-4o is close. Gemini 2.5 Pro wins when you need very long context (5h+ transcripts) in a single call. For chapter generation specifically, Sonnet 4 follows structural prompts most reliably.

How do I handle speaker identification (not just diarization)?

Diarization gives you 'Speaker A' vs 'Speaker B'. For actual names, either (1) accept labels from the user at upload time, (2) cross-reference with Zoom/Teams participant metadata, or (3) use voice biometrics (speaker embeddings matched to an internal directory) - only viable inside one org for compliance and accuracy.

Should I store the full transcript in a vector DB?

Yes, if users will search within videos. Chunk the transcript by 20-40 second windows, embed, and store with timestamps. Users can search 'where did they discuss pricing' and jump directly to the timestamp. Voyage-3 embeddings work well; OpenAI text-embedding-3-small is cheapest if your content is mostly English.

What about realtime captioning for live video?

Different architecture - streaming Deepgram Nova-3 (sub-300ms partial transcripts), no keyframe sampling, no vision pass. Realtime summaries usually accumulate for 5-10 min windows and regenerate. Do not try to reuse the batch pipeline for realtime; the latency budgets are incompatible.

Related

Tools mentioned