Reference Architecture · classification

Resume Screening Pipeline

Last updated: April 16, 2026

Quick answer

The compliant pipeline uses Gemini 2.5 Pro for PDF parsing (it handles layout best), Claude Sonnet 4 for structured skill and experience extraction, a deterministic rules-based scorer matching against job criteria, and a separate bias audit pass that masks names/photos/schools before re-scoring for comparison. Expect $0.08-$0.25 per resume at scale. Never use raw LLM ranking as the final decision - always a structured scorer with human review.

The problem

You receive hundreds to thousands of resumes per job and need to surface the most-qualified candidates to recruiters without introducing bias or running afoul of NYC AEDT, EU AI Act (high-risk category), or EEOC Title VII. The system must parse diverse resume formats, extract structured skills and experience, score against job criteria, provide explanations, and produce an audit log showing that protected-class attributes never influenced the decision.

Architecture

masked rerunResume IntakeINPUTResume ParserLLMPII + Bias MaskingINFRASkill + Experience ExtractorLLMJD Matcher + ScorerINFRABias Audit PassINFRARanked Candidate ListOUTPUTAudit LogDATARecruiter ReviewOUTPUT
input
llm
data
infra
output

Resume Intake

Receives PDF, DOCX, HTML resumes from ATS (Greenhouse, Lever, Workday). Normalizes to UTF-8 text plus original file reference.

Alternatives: Greenhouse webhook, Lever API, Workday integration, Email parse

Resume Parser

Extracts structured fields: work history with dates, education, skills, certifications. Handles multi-column layouts, tables, and sidebar designs.

Alternatives: Claude Sonnet 4 vision, GPT-4o vision, Affinda, Rchilli, AWS Textract + LLM cleanup

PII + Bias Masking

Creates a masked version of the resume: removes name, photo, address, school name, age, pronouns, and any protected-class signal. Keeps skills, experience, and measurable outcomes.

Alternatives: Microsoft Presidio, Custom regex + NER, AWS Comprehend PII, spaCy + custom rules

Skill + Experience Extractor

Extracts a structured skill list with proficiency and years-of-experience per skill. Matches against a controlled taxonomy (ESCO, O*NET, or custom).

Alternatives: GPT-4o, Gemini 2.5 Pro, Custom NER + rules

JD Matcher + Scorer

Deterministic rule-based scorer. For each job-description requirement, scores the candidate on a 0-5 scale with evidence citations. Produces a weighted total.

Alternatives: Custom Python scorer, OPA policy, Drools rules engine

Bias Audit Pass

Re-scores the masked resume and compares to the unmasked score. If scores diverge beyond a threshold, flags the decision for human review. Ensures protected signals did not influence the score.

Alternatives: AIF360, Fairlearn, Custom audit service

Ranked Candidate List

Returns a ranked list with scores, per-requirement evidence, and a confidence band. Surfaces in the recruiter's ATS view.

Alternatives: Greenhouse custom field, Lever tag + score, Custom Retool dashboard

Audit Log

Append-only log: resume hash, parser output, scores, model versions, bias-audit result, final decision, recruiter override. Retained for minimum 4 years per EEOC.

Alternatives: BigQuery, Snowflake, Postgres + S3 raw

Recruiter Review

Human recruiter reviews the top N candidates. Records thumbs up/down with reason codes. All moves-to-interview require human action - the system never auto-rejects or auto-advances.

Alternatives: Greenhouse review, Lever review, Custom UI

The stack

Resume parsingGemini 2.5 Pro (multimodal)

Gemini 2.5 Pro handles multi-column layouts and tables better than GPT-4o in 2026 benchmarks. Managed parsers (Affinda, Rchilli) are $0.01-$0.03 per resume and more accurate on straightforward layouts - use them for high volume and LLMs for edge cases.

Alternatives: Claude Sonnet 4 vision, GPT-4o vision, Affinda, Rchilli

Skill extractionClaude Sonnet 4 with ESCO taxonomy

Sonnet 4 follows structured extraction prompts reliably. Pinning extractions to a controlled taxonomy (ESCO for EU, O*NET for US) avoids the 'Python' vs 'python3' vs 'Py' explosion that breaks matching.

Alternatives: GPT-4o, Gemini 2.5 Pro, Custom NER with SkillNER

PII + bias maskingMicrosoft Presidio + custom regex

Presidio handles 30+ PII types out of the box and is extensible. Masking needs to go beyond standard PII: remove school names, graduation years, gendered pronouns, and photos. A stock PII service will miss several of these.

Alternatives: AWS Comprehend PII, spaCy + rules

Scoring engineDeterministic Python scorer + OPA rules

Do NOT use an LLM as the final ranker. Regulators (NYC AEDT, EEOC) expect deterministic, auditable, replayable scoring logic. LLMs score resumes inconsistently on repeated runs of the same input, which is a compliance red flag.

Alternatives: Drools, Custom service

Bias mitigation + auditAIF360 + internal audit service

AIF360 has the widest coverage of fairness metrics (disparate impact, equal opportunity, demographic parity). You need quarterly bias audits documented per NYC AEDT and you need to demonstrate consistent scoring across protected classes.

Alternatives: Fairlearn, Holistic AI, Custom pipeline

ATS integrationGreenhouse Harvest API

Greenhouse Harvest is the best-documented ATS API and the most common in tech hiring. Inject scores as custom fields; do not replace the ATS's own pipeline stages. Recruiters keep control.

Alternatives: Lever API, Workday, SmartRecruiters

Cost at each scale

Prototype

1,000 resumes/mo

$85/mo

Gemini 2.5 Pro parsing$22
Claude Sonnet 4 extraction$35
Managed parser fallback (Affinda)$15
Hosting + Presidio$8
Audit log storage$5

Startup

50,000 resumes/mo

$5,600/mo

Gemini 2.5 Pro parsing$950
Claude Sonnet 4 extraction$2,400
Affinda bulk parser$600
Self-hosted Presidio + GPU for NER$450
Snowflake audit log$400
AIF360 audit infrastructure$500
Observability + evals$300

Scale

1,000,000 resumes/mo

$98,000/mo

Gemini 2.5 Pro parsing (cached)$14,000
Claude Sonnet 4 extraction$42,000
Bulk managed parser$8,000
Presidio fleet + NER$6,500
Snowflake + S3 audit retention$7,500
Bias audit + AIF360$6,000
ATS integrations, SRE, compliance$14,000

Latency budget

Total P50: 14,560ms
Total P95: 24,940ms
PDF parse (Gemini 2.5 Pro vision)
1800ms · 3200ms p95
PII + bias mask
220ms · 450ms p95
Skill + experience extraction (Sonnet 4)
2400ms · 4200ms p95
JD scorer (deterministic)
40ms · 90ms p95
Bias audit rerun
2600ms · 4500ms p95
End-to-end batch (per resume)
7500ms · 12500ms p95
Median
P95

Tradeoffs

LLM ranker vs deterministic scorer

Using Claude or GPT-4o to directly rank candidates is tempting - they write great rationales. But LLMs are non-deterministic, score inconsistently on reruns, and cannot meet NYC AEDT's requirement of auditable, explainable, reproducible scoring. Use LLMs for extraction; use deterministic logic for scoring. This matters legally in the US and EU (AI Act classifies hiring AI as high-risk).

Managed parsers (Affinda, Rchilli) vs vision LLMs

Managed parsers are $0.01-$0.03 per resume and 95%+ accurate on standard layouts. Gemini 2.5 Pro vision is $0.15-$0.25 per resume but handles creative, multi-column, and image-heavy resumes. Route 80% of volume through a managed parser, 20% (designer/creative resumes) through the LLM.

Strict masking vs useful signal

If you mask too much (every school, every company), you lose signal a recruiter legitimately wants. If you mask too little, you leak bias. Compromise: mask names, photos, addresses, graduation years, and school names only for the first-pass score; reveal them at the recruiter-review stage after scoring is locked. Document and review this tradeoff with legal.

Failure modes & guardrails

Resume parser hallucinates work experience or skills

Mitigation: Require every extracted fact to cite a span from the source text. Reject extractions without a citation span. Run a self-consistency check: re-parse 5-10% of resumes with a different model and flag disagreements above a threshold for human review.

Bias creeps in via proxy features (ZIP code, school, name-based gender/race signal)

Mitigation: Maintain a masked-resume baseline and run the scorer twice - once with full resume, once with masked. If scores diverge by more than 10% systemically for any protected class, freeze the model and investigate. AIF360 can detect disparate impact even when explicit protected attributes are absent.

Candidate appeals: 'why was I rejected?'

Mitigation: Log every per-requirement score with the source evidence text. Generate a human-readable rationale on request. Never auto-reject - require recruiter action to reject, and log the reason. NYC AEDT and GDPR Article 22 require a human-in-the-loop and an explanation.

Job description is itself biased (gendered language, unnecessary requirements)

Mitigation: Run the JD through a bias-check model (Textio, Ongig, or a custom classifier) before you use it as scoring criteria. Flag gendered terms, age proxies, and unnecessary years-of-experience requirements. Biased JDs produce biased matches regardless of how fair your screener is.

Model drifts after a promotion or firing decision affects training signal

Mitigation: Do NOT train ranking or scoring on 'which candidates got hired' - hiring decisions inherit past bias. Score against objective JD-match criteria only. Treat human override data as audit material, not training data. Re-audit quarterly.

Frequently asked questions

Is it legal to use an LLM to screen resumes?

In most US states yes, but with conditions: NYC AEDT requires an annual bias audit and candidate notification for any automated employment decision tool. The EU AI Act classifies hiring AI as high-risk - you need a conformity assessment, risk management system, transparency to candidates, and human oversight. California is tightening rules. Always consult employment counsel before deploying.

Which LLM is best for parsing resumes?

Gemini 2.5 Pro leads on complex layouts (multi-column, tables, creative resumes) at $1.25/$5 per MTok. Claude Sonnet 4 is better at structured extraction once the text is out. GPT-4o vision is competitive but weaker on non-English resumes. Managed parsers (Affinda, Rchilli) beat all LLMs on standard layouts for price and accuracy.

How do I prevent bias?

Mask names, photos, schools, graduation years, and addresses before scoring. Score against objective JD criteria with deterministic logic. Run a monthly bias audit comparing masked vs unmasked scores across protected classes. Use AIF360 or Fairlearn. Never train a ranker on 'who got hired' - you'll inherit historical bias.

Should I let the LLM make the final ranking decision?

No. LLMs are non-deterministic and produce different rankings on reruns. Regulators and plaintiffs can request your decision logic - 'the LLM decided' is not a defensible audit trail. Use LLMs to extract structured data; use deterministic scoring and human review for decisions.

What about AI-generated resumes?

Rising problem. Detection is unreliable. Instead, focus on evidence: concrete accomplishments with numbers, named projects, named tools used at specific companies. Penalize generic language in your scoring rubric. Follow up with screening calls or take-home exercises for candidates who score high on paper.

What's the total cost per hire?

At 1M resumes/month for ~10k hires, AI pipeline cost is $10-$12 per hire ($98k / 10k). Add recruiter time and you are at $200-$500 per hire fully loaded. Compare to $2-5k per hire without automation, but remember: recruiter judgment on the final slate is not replaced, only accelerated.

How long do I retain audit data?

US: minimum 1 year for EEOC; 3 years for federal contractors; many companies retain 4+. EU: consult your DPO - GDPR requires a legitimate retention purpose. NYC AEDT bias audit artifacts must be publicly posted annually. Retain resume, extracted data, scores, model versions, and human decisions for at least 4 years as a baseline.

Related