Reference Architecture · classification
Resume Screening Pipeline
Last updated: April 16, 2026
Quick answer
The compliant pipeline uses Gemini 2.5 Pro for PDF parsing (it handles layout best), Claude Sonnet 4 for structured skill and experience extraction, a deterministic rules-based scorer matching against job criteria, and a separate bias audit pass that masks names/photos/schools before re-scoring for comparison. Expect $0.08-$0.25 per resume at scale. Never use raw LLM ranking as the final decision - always a structured scorer with human review.
The problem
You receive hundreds to thousands of resumes per job and need to surface the most-qualified candidates to recruiters without introducing bias or running afoul of NYC AEDT, EU AI Act (high-risk category), or EEOC Title VII. The system must parse diverse resume formats, extract structured skills and experience, score against job criteria, provide explanations, and produce an audit log showing that protected-class attributes never influenced the decision.
Architecture
Resume Intake
Receives PDF, DOCX, HTML resumes from ATS (Greenhouse, Lever, Workday). Normalizes to UTF-8 text plus original file reference.
Alternatives: Greenhouse webhook, Lever API, Workday integration, Email parse
Resume Parser
Extracts structured fields: work history with dates, education, skills, certifications. Handles multi-column layouts, tables, and sidebar designs.
Alternatives: Claude Sonnet 4 vision, GPT-4o vision, Affinda, Rchilli, AWS Textract + LLM cleanup
PII + Bias Masking
Creates a masked version of the resume: removes name, photo, address, school name, age, pronouns, and any protected-class signal. Keeps skills, experience, and measurable outcomes.
Alternatives: Microsoft Presidio, Custom regex + NER, AWS Comprehend PII, spaCy + custom rules
Skill + Experience Extractor
Extracts a structured skill list with proficiency and years-of-experience per skill. Matches against a controlled taxonomy (ESCO, O*NET, or custom).
Alternatives: GPT-4o, Gemini 2.5 Pro, Custom NER + rules
JD Matcher + Scorer
Deterministic rule-based scorer. For each job-description requirement, scores the candidate on a 0-5 scale with evidence citations. Produces a weighted total.
Alternatives: Custom Python scorer, OPA policy, Drools rules engine
Bias Audit Pass
Re-scores the masked resume and compares to the unmasked score. If scores diverge beyond a threshold, flags the decision for human review. Ensures protected signals did not influence the score.
Alternatives: AIF360, Fairlearn, Custom audit service
Ranked Candidate List
Returns a ranked list with scores, per-requirement evidence, and a confidence band. Surfaces in the recruiter's ATS view.
Alternatives: Greenhouse custom field, Lever tag + score, Custom Retool dashboard
Audit Log
Append-only log: resume hash, parser output, scores, model versions, bias-audit result, final decision, recruiter override. Retained for minimum 4 years per EEOC.
Alternatives: BigQuery, Snowflake, Postgres + S3 raw
Recruiter Review
Human recruiter reviews the top N candidates. Records thumbs up/down with reason codes. All moves-to-interview require human action - the system never auto-rejects or auto-advances.
Alternatives: Greenhouse review, Lever review, Custom UI
The stack
Gemini 2.5 Pro handles multi-column layouts and tables better than GPT-4o in 2026 benchmarks. Managed parsers (Affinda, Rchilli) are $0.01-$0.03 per resume and more accurate on straightforward layouts - use them for high volume and LLMs for edge cases.
Alternatives: Claude Sonnet 4 vision, GPT-4o vision, Affinda, Rchilli
Sonnet 4 follows structured extraction prompts reliably. Pinning extractions to a controlled taxonomy (ESCO for EU, O*NET for US) avoids the 'Python' vs 'python3' vs 'Py' explosion that breaks matching.
Alternatives: GPT-4o, Gemini 2.5 Pro, Custom NER with SkillNER
Presidio handles 30+ PII types out of the box and is extensible. Masking needs to go beyond standard PII: remove school names, graduation years, gendered pronouns, and photos. A stock PII service will miss several of these.
Alternatives: AWS Comprehend PII, spaCy + rules
Do NOT use an LLM as the final ranker. Regulators (NYC AEDT, EEOC) expect deterministic, auditable, replayable scoring logic. LLMs score resumes inconsistently on repeated runs of the same input, which is a compliance red flag.
Alternatives: Drools, Custom service
AIF360 has the widest coverage of fairness metrics (disparate impact, equal opportunity, demographic parity). You need quarterly bias audits documented per NYC AEDT and you need to demonstrate consistent scoring across protected classes.
Alternatives: Fairlearn, Holistic AI, Custom pipeline
Greenhouse Harvest is the best-documented ATS API and the most common in tech hiring. Inject scores as custom fields; do not replace the ATS's own pipeline stages. Recruiters keep control.
Alternatives: Lever API, Workday, SmartRecruiters
Cost at each scale
Prototype
1,000 resumes/mo
$85/mo
Startup
50,000 resumes/mo
$5,600/mo
Scale
1,000,000 resumes/mo
$98,000/mo
Latency budget
Tradeoffs
LLM ranker vs deterministic scorer
Using Claude or GPT-4o to directly rank candidates is tempting - they write great rationales. But LLMs are non-deterministic, score inconsistently on reruns, and cannot meet NYC AEDT's requirement of auditable, explainable, reproducible scoring. Use LLMs for extraction; use deterministic logic for scoring. This matters legally in the US and EU (AI Act classifies hiring AI as high-risk).
Managed parsers (Affinda, Rchilli) vs vision LLMs
Managed parsers are $0.01-$0.03 per resume and 95%+ accurate on standard layouts. Gemini 2.5 Pro vision is $0.15-$0.25 per resume but handles creative, multi-column, and image-heavy resumes. Route 80% of volume through a managed parser, 20% (designer/creative resumes) through the LLM.
Strict masking vs useful signal
If you mask too much (every school, every company), you lose signal a recruiter legitimately wants. If you mask too little, you leak bias. Compromise: mask names, photos, addresses, graduation years, and school names only for the first-pass score; reveal them at the recruiter-review stage after scoring is locked. Document and review this tradeoff with legal.
Failure modes & guardrails
Resume parser hallucinates work experience or skills
Mitigation: Require every extracted fact to cite a span from the source text. Reject extractions without a citation span. Run a self-consistency check: re-parse 5-10% of resumes with a different model and flag disagreements above a threshold for human review.
Bias creeps in via proxy features (ZIP code, school, name-based gender/race signal)
Mitigation: Maintain a masked-resume baseline and run the scorer twice - once with full resume, once with masked. If scores diverge by more than 10% systemically for any protected class, freeze the model and investigate. AIF360 can detect disparate impact even when explicit protected attributes are absent.
Candidate appeals: 'why was I rejected?'
Mitigation: Log every per-requirement score with the source evidence text. Generate a human-readable rationale on request. Never auto-reject - require recruiter action to reject, and log the reason. NYC AEDT and GDPR Article 22 require a human-in-the-loop and an explanation.
Job description is itself biased (gendered language, unnecessary requirements)
Mitigation: Run the JD through a bias-check model (Textio, Ongig, or a custom classifier) before you use it as scoring criteria. Flag gendered terms, age proxies, and unnecessary years-of-experience requirements. Biased JDs produce biased matches regardless of how fair your screener is.
Model drifts after a promotion or firing decision affects training signal
Mitigation: Do NOT train ranking or scoring on 'which candidates got hired' - hiring decisions inherit past bias. Score against objective JD-match criteria only. Treat human override data as audit material, not training data. Re-audit quarterly.
Frequently asked questions
Is it legal to use an LLM to screen resumes?
In most US states yes, but with conditions: NYC AEDT requires an annual bias audit and candidate notification for any automated employment decision tool. The EU AI Act classifies hiring AI as high-risk - you need a conformity assessment, risk management system, transparency to candidates, and human oversight. California is tightening rules. Always consult employment counsel before deploying.
Which LLM is best for parsing resumes?
Gemini 2.5 Pro leads on complex layouts (multi-column, tables, creative resumes) at $1.25/$5 per MTok. Claude Sonnet 4 is better at structured extraction once the text is out. GPT-4o vision is competitive but weaker on non-English resumes. Managed parsers (Affinda, Rchilli) beat all LLMs on standard layouts for price and accuracy.
How do I prevent bias?
Mask names, photos, schools, graduation years, and addresses before scoring. Score against objective JD criteria with deterministic logic. Run a monthly bias audit comparing masked vs unmasked scores across protected classes. Use AIF360 or Fairlearn. Never train a ranker on 'who got hired' - you'll inherit historical bias.
Should I let the LLM make the final ranking decision?
No. LLMs are non-deterministic and produce different rankings on reruns. Regulators and plaintiffs can request your decision logic - 'the LLM decided' is not a defensible audit trail. Use LLMs to extract structured data; use deterministic scoring and human review for decisions.
What about AI-generated resumes?
Rising problem. Detection is unreliable. Instead, focus on evidence: concrete accomplishments with numbers, named projects, named tools used at specific companies. Penalize generic language in your scoring rubric. Follow up with screening calls or take-home exercises for candidates who score high on paper.
What's the total cost per hire?
At 1M resumes/month for ~10k hires, AI pipeline cost is $10-$12 per hire ($98k / 10k). Add recruiter time and you are at $200-$500 per hire fully loaded. Compare to $2-5k per hire without automation, but remember: recruiter judgment on the final slate is not replaced, only accelerated.
How long do I retain audit data?
US: minimum 1 year for EEOC; 3 years for federal contractors; many companies retain 4+. EU: consult your DPO - GDPR requires a legitimate retention purpose. NYC AEDT bias audit artifacts must be publicly posted annually. Retain resume, extracted data, scores, model versions, and human decisions for at least 4 years as a baseline.
Related
Architectures
Contract Clause Extraction Pipeline
Reference architecture for turning legal contracts (MSAs, NDAs, SOWs, leases) into a structured clause databas...
Intent Classification for Message Routing
Reference architecture for multi-label intent classification routing inbound customer messages to the right te...
OCR + Document Understanding Pipeline
Reference architecture for turning scanned documents, invoices, receipts, forms, and handwritten notes into st...