safetyintermediate

LLM Guardrails: Input and Output Safety Layers (2026)

Quick Answer

Guardrails are policy enforcement layers around your LLM. Input guardrails check user messages for: off-topic requests, prohibited content, and prompt injection. Output guardrails check LLM responses for: harmful content, hallucinations, and policy violations. Guardrails can be rules-based (fast, deterministic), LLM-based (flexible, more accurate), or specialized classifiers (Llama Guard, Perspective API). Use layered guardrails — no single check catches everything.

When to Use

  • Customer-facing LLM applications where users could send harmful, abusive, or off-topic requests
  • Applications in regulated industries (healthcare, finance, legal) with content policy requirements
  • Chatbots and assistants that must stay within a defined topical scope
  • Multi-user platforms where LLM outputs are shown to other users (social, marketplace)
  • Any application where reputational, legal, or safety risk from LLM outputs is significant

How It Works

  1. 1Input guardrails run before the LLM call: topic filter (is the query within scope?), safety classifier (is the query harmful?), PII detection (does the query contain sensitive data?), injection detector (does the query contain injection patterns?).
  2. 2Output guardrails run after the LLM response: content safety check (does the output contain harmful content?), factual grounding check (are claims supported by context?), policy compliance check (does output follow brand/legal guidelines?), format validation.
  3. 3Layered architecture: fast rules-based checks first (milliseconds), then cheap LLM classifiers (100ms), then expensive full LLM validation (500ms). Early fast layers catch obvious violations; expensive layers handle subtle ones.
  4. 4Guardrails as a service: NeMo Guardrails (NVIDIA), Guardrails AI (Python library), LlamaGuard (Meta open-source), Amazon Bedrock Guardrails, Azure AI Content Safety. Each offers different controls and integrates differently.
  5. 5Graceful handling: when guardrails trigger, return a helpful redirect rather than a bare error: 'I'm designed to help with [scope]. For [detected topic], I'd recommend [alternative resource].' Users should understand why the request was declined.

Examples

NeMo Guardrails basic setup
# Install: pip install nemoguardrails
# colang config file: config/rails.co

'''
define user express greeting
  "hello"
  "hi"

define bot express greeting
  "Hello! I'm the LLMversus assistant. I help with AI pricing and comparison questions."

define flow
  user express greeting
  bot express greeting

define user ask off topic
  "Can you write code for me?"
  "Tell me a joke"
  
define bot refuse off topic
  "I'm focused on AI tool comparisons and pricing. I can't help with [off-topic request], but I can help you compare LLM pricing!"

define input rail
  execute llama_guard_check  # Safety classifier

define output rail  
  execute check_facts  # Hallucination check
'''

# Python integration
from nemoguardrails import RailsConfig, LLMRails

config = RailsConfig.from_path('./config')
rails = LLMRails(config)

response = await rails.generate_async(messages=[{'role': 'user', 'content': user_input}])
Output:NeMo Guardrails uses Colang (a custom DSL) to define conversation flows, input rails, and output rails. The llama_guard_check input rail automatically screens for harmful content using Meta's Llama Guard model.
Custom lightweight guardrail
import re
from anthropic import Anthropic

client = Anthropic()

SCOPE_VIOLATION_PATTERNS = [
    r'write.*code', r'generate.*image', r'translate.*to',
    r'recipe|cooking|food', r'medical.*advice|diagnos'
]

PROHIBITED_CONTENT = [
    r'bomb|weapon|explosive', r'illegal.*drug', r'hack.*into'
]

def input_guardrail(user_message: str) -> tuple[bool, str]:
    # Check prohibited content first (fast)
    for pattern in PROHIBITED_CONTENT:
        if re.search(pattern, user_message.lower()):
            return False, 'I cannot help with that request.'
    
    # Check scope
    for pattern in SCOPE_VIOLATION_PATTERNS:
        if re.search(pattern, user_message.lower()):
            return False, 'I\'m focused on AI pricing comparisons. For that request, please try a general assistant.'
    
    return True, ''

def guarded_chat(user_message: str) -> str:
    allowed, reason = input_guardrail(user_message)
    if not allowed:
        return reason
    
    response = client.messages.create(
        model='claude-3-5-haiku-20241022', max_tokens=1024,
        messages=[{'role': 'user', 'content': user_message}]
    )
    return response.content[0].text
Output:Lightweight regex-based input guardrail. Fast (<1ms), free, handles obvious violations. Pair with LLM-based guardrail for subtle cases. False positive rate is higher than LLM-based — test with real queries before deploying.

Common Mistakes

  • Over-blocking with overly broad guardrails — a guardrail that blocks 5% of legitimate queries is worse than missing some edge cases. Measure false positive rate; target under 1% false positives for production applications.
  • Guardrails as a substitute for model safety training — guardrails add latency and cost, and can be worked around by sophisticated users. They're a complement to model safety, not a replacement. Use the most safety-trained model available as your base.
  • No monitoring of guardrail triggers — guardrail trigger events are valuable data. Log what content triggered each guardrail, which users triggered them most, and what fraction of total traffic is blocked. This tells you how much adversarial usage you're experiencing.
  • Not testing guardrails adversarially — users will find ways around guardrails. Hire red teamers or use automated adversarial testing (Garak) to probe your guardrails before launch. What you test during development is usually less creative than what production users attempt.

FAQ

What's the difference between Guardrails AI and NeMo Guardrails?+

Guardrails AI (Python) focuses on output validation — defining validators for schema, format, content, and factual grounding of LLM outputs. It's best for structured output validation. NeMo Guardrails (NVIDIA) focuses on conversational safety — defining conversation rails, topical scope, and safety checks for chat applications. It's best for chatbots and assistants. Many production systems use both.

What is Llama Guard and should I use it?+

Llama Guard (Meta, open-source) is a safety classifier fine-tuned to detect unsafe content in LLM inputs and outputs. It classifies requests/responses against a configurable set of harm categories (violence, self-harm, illegal activity, etc.). It's more accurate than generic LLMs for safety classification and can run locally. In 2026, Llama Guard 3 supports multi-turn conversations and custom safety categories.

How do I handle jailbreak attempts?+

Jailbreaks attempt to override safety training through roleplay scenarios, hypotheticals, or social engineering. Defense: (1) Use safety-trained models (Claude, GPT-4o). (2) Add a jailbreak detector that flags prompts containing known jailbreak patterns (Llama Guard detects many). (3) Rate-limit users who trigger repeated safety violations. (4) Log and review jailbreak attempts to improve your defenses. Accept that sophisticated jailbreaks against safety-trained models are the model provider's responsibility.

What's the latency overhead of guardrails?+

Rules-based guardrails: <1ms. Lightweight LLM classifiers (Haiku, GPT-4o-mini): 50-150ms. Full safety classifiers (Llama Guard): 50-200ms on GPU. Run input and output guardrails in parallel where possible. For time-sensitive applications, run input guardrails before the main LLM call and output guardrails in parallel with streaming delivery.

Do I need guardrails if I use Claude or GPT-4o?+

Yes. Safety-trained frontier models have excellent built-in safety for common harms but don't enforce your application-specific policies (topical scope, brand voice, business rules). Your guardrail layer adds: application-specific scope enforcement, PII handling, domain-specific content policies, and defense-in-depth against model failures. Think of model safety and application guardrails as complementary layers.

Related