safetyintermediate

PII Handling in LLM Applications (2026)

Quick Answer

PII handling in LLM pipelines requires: (1) detecting PII in inputs before sending to the model, (2) redacting or pseudonymizing PII before LLM calls when possible, (3) not logging raw user data with PII in production systems, and (4) using data processing agreements with LLM providers. The safest architecture: pseudonymize PII before the LLM call, process, then re-substitute in the output.

When to Use

✓Processing user-submitted documents, emails, or support tickets that may contain personal data
✓Healthcare applications where PHI (protected health information) is present in text
✓Financial applications processing customer data regulated by GDPR, CCPA, or PCI-DSS
✓Customer support automation where conversation history contains names, emails, account numbers
✓Any multi-tenant application where one user's PII must not appear in another user's LLM context

How It Works

1Detection: use a PII detector to identify PII in input text. Options: Microsoft Presidio (open-source, Python), AWS Comprehend, Google Cloud DLP, or an LLM-based extractor. Presidio identifies 50+ entity types (names, SSNs, credit cards, phone numbers, emails) with configurable confidence thresholds.
2Redaction: replace detected PII with [REDACTED] or [PERSON_NAME] before sending to the LLM. Simple, privacy-preserving, but loses context that might be needed for accurate processing.
3Pseudonymization: replace PII with consistent fake values before the LLM call and reverse after. 'John Smith' → 'Alex Taylor', 'john.smith@corp.com' → 'alex.taylor@example.com'. The LLM sees realistic text; PII never leaves your infrastructure.
4Tokenization: replace PII with tokens (UUID or sequential ID) and store the mapping securely. 'John Smith' → '[ENTITY_001]'. Reverse tokenize in the output. More precise than pseudonymization but produces less natural text for the LLM to process.
5Data minimization: the best PII handling is not collecting it at all. Review what data actually needs to be in the LLM prompt. Account numbers, SSNs, and full addresses are rarely needed — strip them before the prompt even if they're not 'detected' PII.

Examples

PII redaction with Presidio

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def redact_pii(text: str, language: str = 'en') -> dict:
    # Detect PII
    results = analyzer.analyze(text=text, language=language)
    
    # Redact with consistent pseudonyms
    anonymized = anonymizer.anonymize(
        text=text,
        analyzer_results=results,
        operators={
            'PERSON': OperatorConfig('replace', {'new_value': '<PERSON>'}),
            'EMAIL_ADDRESS': OperatorConfig('replace', {'new_value': '<EMAIL>'}),
            'PHONE_NUMBER': OperatorConfig('replace', {'new_value': '<PHONE>'}),
            'CREDIT_CARD': OperatorConfig('replace', {'new_value': '<CREDIT_CARD>'}),
            'US_SSN': OperatorConfig('replace', {'new_value': '<SSN>'}),
        }
    )
    
    return {
        'redacted_text': anonymized.text,
        'detected_entities': [r.entity_type for r in results],
        'original_length': len(text),
        'pii_found': len(results) > 0
    }

# Usage before LLM call
text = 'Hi, I\'m John Smith, SSN 123-45-6789, email john@example.com'
result = redact_pii(text)
# Send result['redacted_text'] to LLM, not original text

Output:Presidio detects PERSON, SSN, EMAIL and replaces them. Redacted text: 'Hi, I'm <PERSON>, SSN <SSN>, email <EMAIL>'. Log the detected_entities for audit purposes without logging the actual PII values.

Pseudonymization with reversibility

import hashlib
import json
from typing import Optional

class PseudonymizationManager:
    def __init__(self, secret_key: str):
        self.key = secret_key
        self.forward_map = {}  # PII → pseudonym
        self.reverse_map = {}  # pseudonym → PII
    
    def pseudonymize(self, text: str, pii_spans: list[tuple]) -> tuple[str, dict]:
        result = list(text)
        mappings = {}
        
        for entity_type, start, end in sorted(pii_spans, reverse=True):
            original = text[start:end]
            pseudo = self._generate_pseudonym(original, entity_type)
            mappings[pseudo] = original
            result[start:end] = list(pseudo)
        
        return ''.join(result), mappings
    
    def _generate_pseudonym(self, value: str, entity_type: str) -> str:
        # Deterministic pseudonym — same value always maps to same pseudonym
        h = hashlib.sha256(f'{self.key}:{entity_type}:{value}'.encode()).hexdigest()[:8]
        return f'[{entity_type}_{h}]'
    
    def reverse(self, text: str, mappings: dict) -> str:
        for pseudo, original in mappings.items():
            text = text.replace(pseudo, original)
        return text

Output:Deterministic pseudonymization: 'John Smith' always maps to the same '[PERSON_a3f8b21c]'. The reverse map is held in memory for the duration of the request, then discarded. Never persist the reverse map to disk.

Common Mistakes

✗Logging raw prompts containing PII — many observability tools log full LLM request/response by default. This sends PII to third-party logging services. Always filter PII from logs before sending to external platforms, or use a PII-safe logging proxy.
✗Trusting LLM-based PII detection without validation — LLMs can miss PII (especially in non-standard formats) or hallucinate PII that isn't there. Use dedicated PII detection tools (Presidio) for regulated pipelines, not general-purpose LLMs.
✗Processing HIPAA-covered data without a BAA — if your application processes health information (PHI), you need a Business Associate Agreement (BAA) with your LLM provider before you can legally process that data. Anthropic, OpenAI, and AWS all offer BAAs.
✗Redaction that breaks context for the LLM — replacing 'Dr. Sarah Johnson' with '[REDACTED]' removes the context that this is a doctor. Use entity-type placeholders ('<DOCTOR_NAME>') that preserve semantic context while removing the actual PII.

FAQ

Do LLM providers store my inputs?+

Anthropic does not train on API data by default and stores prompts for a limited period for abuse monitoring. OpenAI does not train on API data by default but stores for 30 days unless zero data retention is enabled (Enterprise). Google Cloud (Vertex AI) does not train on data by default and stores per your data retention settings. Always review the provider's data processing agreement before sending sensitive data.

Can I use frontier LLMs for HIPAA-regulated workflows?+

Yes, with the right agreements. Anthropic, OpenAI, and AWS Bedrock all offer HIPAA Business Associate Agreements (BAAs). You must sign the BAA before processing PHI. Additionally, you must implement technical safeguards: encryption, access controls, audit logging, and the minimum necessary standard (only include PHI required for the task).

What counts as PII for GDPR?+

Any information that can identify a natural person directly or indirectly: name, email, phone, IP address, location data, cookies, health data, financial information, social media profiles. Pseudonymized data is still PII under GDPR if re-identification is possible. Truly anonymized data (statistically de-identified) is not. When in doubt, treat data as PII.

How do I handle PII in RAG document corpora?+

Scan documents for PII at ingest time. For documents containing PII: (1) Assess whether PII is load-bearing for the use case. (2) If not, redact before embedding and indexing. (3) If load-bearing (e.g., HR system), implement access controls so PII-containing chunks are only retrieved for authorized users. Never index raw PII-containing documents in a shared vector store without access controls.

What PII detection libraries work best?+

Microsoft Presidio (open-source, Python/Go) is the most comprehensive: 50+ entity types, multiple languages, custom recognizers, and active maintenance. For cloud-native: AWS Comprehend PII detection and Google Cloud DLP are scalable managed services. For real-time/embedded: spaCy NER with custom models is the fastest local option. Presidio is the standard choice for most applications.

guardrails output validation prompt injection defense metadata filtering ↗ enterprise doc search ↗ email triage agent ↗ invoice structured extraction

PII Handling in LLM Applications (2026)

When to Use

How It Works

Examples

Common Mistakes

FAQ

Related