Question 1

When should I use a fine-tuned model vs a prompted LLM for NER?

Accepted Answer

Use a fine-tuned model (spaCy, Flair, GLiNER) when: you need to process more than 100,000 documents/day; latency matters (real-time streaming, user-facing search); your entity types are standard (PER, ORG, LOC, DATE) and your training data is representative; and your budget is under $200/month for the workload. Use a prompted LLM when: you need custom entity types without labeled training data; your text is informal, noisy, or domain-specific in ways a general model handles poorly; accuracy above 95% is required; or you're doing zero-shot NER on a new domain during prototyping. The hybrid approach — fast model for high-confidence predictions, LLM for ambiguous spans — is optimal for production systems processing 10K-500K documents/day.

Question 2

What F1 scores can I realistically expect?

Accepted Answer

On standard English newswire text (CoNLL-2003 benchmark): fine-tuned BERT-large achieves 92-93% F1, spaCy en_core_web_trf achieves 89-91%, and GPT-4o achieves 91-94% in zero-shot. On domain-specific text (medical, legal, financial), general models drop to 75-85% F1, while domain-fine-tuned models (BioBERT for medical, FinBERT for financial) recover to 88-93%. On informal text (social media, customer messages): all models drop 5-10 points. Custom entity types (non-standard categories) see 10-20 point drops on models not fine-tuned for them; LLMs with few-shot examples typically achieve 85-92% F1 on custom types.

Question 3

How do I handle nested entities and overlapping spans?

Accepted Answer

Traditional sequence labeling models (BERT with BIO/BIOES tagging) cannot handle overlapping spans — a single token can only belong to one entity. For nested entities (e.g., 'New York City' within 'New York City Mayor's Office'), use: (1) GLiNER, which is specifically designed for nested and overlapping entity extraction; (2) span-based models that classify all possible spans independently; or (3) an LLM with a structured output schema that can return overlapping spans. If using an LLM, explicitly prompt it to identify all applicable entity types for each span and return them as an array, not forcing a single-label decision.

Question 4

How do I build a labeled training dataset for custom NER efficiently?

Accepted Answer

Start with LLM-assisted annotation: use Claude or GPT-4o to generate initial annotations for 500-1,000 documents, then have human annotators review and correct — this is 3-5x faster than annotation from scratch. Tools like Label Studio, Prodigy (from spaCy), and Argilla provide annotation interfaces with model-in-the-loop suggestion. Aim for at least 200-500 annotated examples per entity type for fine-tuning. Prioritize difficult, ambiguous examples rather than easy clear-cut cases — hard examples improve model robustness more than additional easy examples. Use inter-annotator agreement (IAA > 0.80 Cohen's Kappa) as a quality gate before using labels for training.

Question 5

What is GLiNER and when should I use it over spaCy?

Accepted Answer

GLiNER (Generalist and Lightweight Model for Named Entity Recognition) is an open-source model that performs zero-shot NER by encoding both the text and a list of entity type labels, then finding matching spans — similar in spirit to how CLIP works for images. Unlike spaCy, which requires a model fine-tuned for specific entity types, GLiNER can extract arbitrary entity types at inference time just by providing the label text. Use GLiNER when: you have many entity types (10+); entity types change frequently; you need zero-shot transfer to new domains; or you're prototyping before committing to fine-tuning. Tradeoff: GLiNER is slower than a spaCy pipeline (roughly 500-2000 docs/min vs 10,000+ for spaCy on CPU) and slightly less accurate on well-represented types.

Question 6

How do I handle NER in low-resource languages?

Accepted Answer

For languages with limited NER training data (most non-English languages outside French, German, Spanish, Chinese): (1) Use multilingual models: XLM-RoBERTa-large fine-tuned for NER covers 100+ languages and achieves competitive F1 on many. (2) Use LLMs: GPT-4o and Claude Sonnet 4 perform well in 30-40 languages including less-common ones, though accuracy drops for very low-resource languages (Swahili, Amharic, etc.). (3) Cross-lingual transfer: fine-tune on English data, then evaluate on your target language — often achieves 80-85% of the performance of a target-language fine-tuned model at zero cost. (4) Translate first: for batch processing, translate to English, extract entities, then map back — viable when translation quality is high and entity names are preserved.

AI for Named Entity Recognition at Scale

The problem

Core workflows

High-Volume Standard NER Pipeline

Custom Entity Extraction with LLMs

Multi-Language NER

Entity Linking and Disambiguation

Confidence-Threshold Hybrid Pipeline

Real-Time News Entity Indexing

Top tools

Top models

FAQs

Related architectures