task · Use Case

AI for Named Entity Recognition at Scale

Extract people, organizations, locations, dates, and custom domain entities from text at scale using fine-tuned models or prompted LLMs. Compare speed-accuracy tradeoffs and multi-language support across approaches.

Updated Apr 16, 20266 workflows~$0.1–$8 per 1,000 requests

Quick answer

For standard entities (PER, ORG, LOC, DATE) at high volume, use a fine-tuned BERT-class model (spaCy en_core_web_trf, GLiNER) — it runs at under $0.10/1,000 documents and achieves 90-94% F1. For custom domain entities or when accuracy above 95% is required, use Claude Haiku or GPT-4o-mini with few-shot examples. A hybrid pipeline using the fast model first and escalating ambiguous spans to an LLM costs $0.50-2 per 1,000 documents with 96%+ accuracy.

The problem

Enterprise teams processing news feeds, legal documents, or customer communications at scale face a fundamental tradeoff: traditional NER models (spaCy, Flair) run at 50,000+ tokens per second on CPU but struggle with domain-specific entities (drug names, financial instruments, internal product codes) and informal text. LLMs extract custom entities accurately but run 100-1000x slower and at 50-200x higher cost. A newsroom or compliance team processing 500,000 documents per month must balance accuracy, latency, and a budget that can't absorb $10,000+/month in frontier model costs.

Core workflows

High-Volume Standard NER Pipeline

Run fine-tuned transformer models (spaCy, GLiNER) for standard entity types at 10,000+ docs/minute on GPU. Ideal for news monitoring, compliance screening, and search indexing where cost and latency matter most.

claude-haiku-3-5spacyArchitecture →

Custom Entity Extraction with LLMs

Use few-shot prompted LLMs to extract domain-specific entities: pharmaceutical compounds, financial instruments, legal citations, product SKUs. Achieves 92-96% F1 on custom types without fine-tuning data.

claude-haiku-3-5langchainArchitecture →

Multi-Language NER

Extract entities across 50+ languages using multilingual models (mBERT, XLM-RoBERTa) or LLMs. Critical for global news monitoring, international compliance, and multilingual customer support pipelines.

gpt-4o-miniglinerArchitecture →

Entity Linking and Disambiguation

After extraction, link entity spans to canonical identifiers (Wikidata QIDs, stock tickers, internal IDs). Resolves 'Apple' (company vs fruit), 'Jordan' (person vs country) based on context. Critical for knowledge graph population.

claude-sonnet-4spacyArchitecture →

Confidence-Threshold Hybrid Pipeline

Run a fast transformer model first; escalate spans with confidence below 0.80 to an LLM for verification. Achieves 96%+ accuracy at 60-70% lower cost than running all documents through an LLM.

claude-haiku-3-5humanloopArchitecture →

Real-Time News Entity Indexing

Stream news articles through an NER pipeline to build a real-time entity index for monitoring, alerting, and analytics. Identify mentions of tracked companies, people, and events within seconds of publication.

gpt-4o-miniglinerArchitecture →

Top tools

  • spacy
  • gliner
  • huggingface-transformers
  • langchain
  • flair-nlp
  • aws-comprehend

Top models

  • claude-haiku-3-5
  • gpt-4o-mini
  • claude-sonnet-4
  • gemini-2-0-flash

FAQs

When should I use a fine-tuned model vs a prompted LLM for NER?

Use a fine-tuned model (spaCy, Flair, GLiNER) when: you need to process more than 100,000 documents/day; latency matters (real-time streaming, user-facing search); your entity types are standard (PER, ORG, LOC, DATE) and your training data is representative; and your budget is under $200/month for the workload. Use a prompted LLM when: you need custom entity types without labeled training data; your text is informal, noisy, or domain-specific in ways a general model handles poorly; accuracy above 95% is required; or you're doing zero-shot NER on a new domain during prototyping. The hybrid approach — fast model for high-confidence predictions, LLM for ambiguous spans — is optimal for production systems processing 10K-500K documents/day.

What F1 scores can I realistically expect?

On standard English newswire text (CoNLL-2003 benchmark): fine-tuned BERT-large achieves 92-93% F1, spaCy en_core_web_trf achieves 89-91%, and GPT-4o achieves 91-94% in zero-shot. On domain-specific text (medical, legal, financial), general models drop to 75-85% F1, while domain-fine-tuned models (BioBERT for medical, FinBERT for financial) recover to 88-93%. On informal text (social media, customer messages): all models drop 5-10 points. Custom entity types (non-standard categories) see 10-20 point drops on models not fine-tuned for them; LLMs with few-shot examples typically achieve 85-92% F1 on custom types.

How do I handle nested entities and overlapping spans?

Traditional sequence labeling models (BERT with BIO/BIOES tagging) cannot handle overlapping spans — a single token can only belong to one entity. For nested entities (e.g., 'New York City' within 'New York City Mayor's Office'), use: (1) GLiNER, which is specifically designed for nested and overlapping entity extraction; (2) span-based models that classify all possible spans independently; or (3) an LLM with a structured output schema that can return overlapping spans. If using an LLM, explicitly prompt it to identify all applicable entity types for each span and return them as an array, not forcing a single-label decision.

How do I build a labeled training dataset for custom NER efficiently?

Start with LLM-assisted annotation: use Claude or GPT-4o to generate initial annotations for 500-1,000 documents, then have human annotators review and correct — this is 3-5x faster than annotation from scratch. Tools like Label Studio, Prodigy (from spaCy), and Argilla provide annotation interfaces with model-in-the-loop suggestion. Aim for at least 200-500 annotated examples per entity type for fine-tuning. Prioritize difficult, ambiguous examples rather than easy clear-cut cases — hard examples improve model robustness more than additional easy examples. Use inter-annotator agreement (IAA > 0.80 Cohen's Kappa) as a quality gate before using labels for training.

What is GLiNER and when should I use it over spaCy?

GLiNER (Generalist and Lightweight Model for Named Entity Recognition) is an open-source model that performs zero-shot NER by encoding both the text and a list of entity type labels, then finding matching spans — similar in spirit to how CLIP works for images. Unlike spaCy, which requires a model fine-tuned for specific entity types, GLiNER can extract arbitrary entity types at inference time just by providing the label text. Use GLiNER when: you have many entity types (10+); entity types change frequently; you need zero-shot transfer to new domains; or you're prototyping before committing to fine-tuning. Tradeoff: GLiNER is slower than a spaCy pipeline (roughly 500-2000 docs/min vs 10,000+ for spaCy on CPU) and slightly less accurate on well-represented types.

How do I handle NER in low-resource languages?

For languages with limited NER training data (most non-English languages outside French, German, Spanish, Chinese): (1) Use multilingual models: XLM-RoBERTa-large fine-tuned for NER covers 100+ languages and achieves competitive F1 on many. (2) Use LLMs: GPT-4o and Claude Sonnet 4 perform well in 30-40 languages including less-common ones, though accuracy drops for very low-resource languages (Swahili, Amharic, etc.). (3) Cross-lingual transfer: fine-tune on English data, then evaluate on your target language — often achieves 80-85% of the performance of a target-language fine-tuned model at zero cost. (4) Translate first: for batch processing, translate to English, extract entities, then map back — viable when translation quality is high and entity names are preserved.

Related architectures