LLM Skills Library
50 AI engineering techniques — real prompts, real code, real tradeoffs.
Prompting
Core techniques for getting better outputs from any LLM.
Chain-of-Thought
intermediateChain-of-thought prompting guides LLMs to reason step-by-step before answering. It dramatically improves accuracy on math, logic, and multi-step reasoning tasks. Learn zero-shot CoT, few-shot CoT, and when to use each.
Context Stuffing
intermediateContext stuffing fills the LLM context window with relevant documents, code, or data so the model can reason over it directly — without retrieval. Learn when to stuff context vs. build RAG, and how to structure large contexts for maximum accuracy.
Few-Shot Prompting
beginnerFew-shot prompting dramatically improves LLM consistency by showing 2–8 examples of the desired input-output pattern before the actual query. Learn example selection, ordering, and formatting strategies.
Meta-Prompting
advancedMeta-prompting uses an LLM to generate, critique, and refine prompts for a target LLM. It automates prompt engineering by having the model act as a prompt optimizer — dramatically reducing manual iteration time.
Prompt Chaining
intermediatePrompt chaining breaks complex tasks into sequential LLM calls where each output feeds the next. Learn when to chain, how to design handoffs, and how to handle errors mid-chain.
Prompt Compression
advancedPrompt compression reduces input token count by 50–90% through selective information removal, LLMLingua-style token pruning, and semantic summarization. Learn when compression helps, when it hurts, and how to measure the tradeoff.
ReAct Pattern
advancedReAct (Reasoning + Acting) interleaves LLM reasoning traces with tool actions, enabling agents to decompose tasks, call external APIs, and update their plan based on observations. It's the foundation of most production LLM agents.
Role Prompting
beginnerRole prompting assigns a persona or expert identity to an LLM to improve output quality and domain alignment. Learn which roles work, why they help, and the limits of persona assignment.
Self-Consistency
advancedSelf-consistency runs the same chain-of-thought prompt multiple times with temperature > 0 and takes a majority vote on the final answers. It reliably improves accuracy on reasoning and math tasks at the cost of multiple inference calls.
Structured Output
intermediateGetting reliable JSON, CSV, or schema-compliant output from LLMs. Learn constrained decoding, schema prompting, validation loops, and which APIs guarantee valid JSON.
System Prompt Design
intermediateSystem prompts are the foundation of every production LLM application. Learn how to write system prompts that consistently control persona, format, safety constraints, and output quality.
Temperature & Sampling
intermediateTemperature, top-p, top-k, and frequency penalties control how an LLM samples its output. Learn exactly what each parameter does, when to turn temperature to zero, and how to tune sampling for creative vs. deterministic tasks.
Tree of Thought
advancedTree of Thought (ToT) enables LLMs to explore multiple reasoning branches in parallel, evaluate intermediate steps, and backtrack — mimicking deliberate human problem-solving for hard tasks.
XML Tags for Claude
intermediateClaude is trained to follow XML-tagged prompt structure exceptionally well. Learn how to use XML tags to separate instructions from content, pass multi-part inputs, and improve Claude's output consistency and accuracy.
Zero-Shot Prompting
beginnerZero-shot prompting lets you use LLMs without providing any examples. Learn when it works, when it fails, and how to write zero-shot prompts that get reliable results.
RAG & Retrieval
Build accurate retrieval systems that don't hallucinate.
Chunking Strategies
intermediateChunking splits documents into pieces for embedding and retrieval. The right chunking strategy — fixed-size, semantic, hierarchical, or late chunking — directly determines RAG accuracy. Learn the tradeoffs for each approach.
Contextual Retrieval
advancedContextual retrieval (Anthropic, 2024) prepends a short context summary to each chunk before embedding, giving the embedding model information about where the chunk sits in the document. It reduces retrieval failures by 49% on Anthropic's benchmarks.
Embedding Selection
intermediateChoosing the right embedding model is the single biggest lever for RAG retrieval quality. This guide covers the major embedding models in 2026, benchmarks, dimension/cost tradeoffs, and how to evaluate embeddings on your specific domain.
GraphRAG
advancedGraphRAG builds a knowledge graph from your corpus and uses it to answer complex, multi-hop questions that naive vector RAG fails on. Microsoft's GraphRAG system (2024) showed 2-5x better performance on global/analytical queries.
Hybrid Search
advancedHybrid search combines dense vector search (semantic similarity) with sparse keyword search (BM25) to retrieve documents. It consistently outperforms either approach alone, especially for queries with specific terms, product names, or technical jargon.
Late Chunking
advancedLate chunking (Jina AI, 2024) embeds the full document first to capture global context, then pools token embeddings into chunk representations. This preserves cross-sentence context in each chunk's embedding, improving retrieval for context-dependent text.
Metadata Filtering
intermediateMetadata filtering narrows the vector search space by pre-filtering documents on structured attributes — date, category, author, language — before semantic search. It dramatically improves precision and enables multi-tenant RAG.
Query Expansion
advancedQuery expansion uses an LLM to rewrite, decompose, or augment user queries before retrieval, improving recall by generating hypothetical documents, sub-queries, or synonym variations. It solves the vocabulary mismatch problem in RAG.
RAG Evaluation
intermediateRAG evaluation measures both retrieval quality (did we fetch the right chunks?) and generation quality (did the LLM produce an accurate, grounded answer?). Learn the RAGAS framework, key metrics, and how to build a continuous eval pipeline.
Reranking
intermediateReranking is a second-stage retrieval step that scores each retrieved chunk for relevance to the query using a cross-encoder model. It consistently improves RAG answer quality by 15–30% over pure vector search with minimal added latency.
Agents & Tools
Tool use, memory, planning, and multi-agent coordination.
Agent Evaluation
advancedEvaluating LLM agents is harder than evaluating single-turn LLMs because agents take sequences of actions, have long-horizon goals, and can fail in many ways. Learn task completion metrics, trajectory evaluation, and how to build regression tests for agents.
Agent Memory
intermediateLLM agents need memory to maintain context across conversations and sessions. Learn the four memory types (in-context, external, procedural, episodic), when to use each, and how to build persistent memory systems that don't hallucinate past events.
Agent Planning
advancedAgent planning is how LLM agents decompose complex tasks into executable steps, manage dependencies between steps, and adapt the plan when execution diverges from expectations. Good planning architecture is the difference between agents that complete 10-step tasks and ones that fail after 3.
Error Recovery
intermediateProduction LLM agents fail in predictable ways: tool errors, invalid JSON, hallucinated arguments, and infinite loops. Learn defensive error handling patterns that let agents recover gracefully rather than crashing or producing wrong outputs.
Human-in-the-Loop
intermediateHuman-in-the-loop (HITL) patterns define when LLM agents pause for human confirmation, verification, or input. Proper HITL design prevents costly agent mistakes while avoiding excessive interruptions that destroy productivity.
Multi-Agent Coordination
advancedMulti-agent systems use an orchestrator LLM to decompose tasks and delegate to specialized subagents. This enables parallelism, specialization, and fault isolation that single-agent architectures can't achieve. Learn the orchestrator/subagent pattern, handoff protocols, and when to use agents vs. tools.
Parallel Tool Calls
advancedParallel tool calling lets LLMs request multiple tool executions simultaneously in a single response, rather than sequentially. This reduces multi-step agent latency by 50–80% when tasks can run concurrently.
Prompt Caching
intermediatePrompt caching (Anthropic, OpenAI) stores computed key-value pairs for long prompt prefixes and reuses them across requests. It reduces input token costs by 90% and latency by 85% on cache hits — essential for any agent with a large system prompt or repeated context.
Streaming
intermediateStreaming sends LLM tokens to the client as they're generated instead of waiting for the complete response. It reduces perceived time-to-first-token from 3–10s to under 500ms, dramatically improving user experience for long-form outputs.
Tool Use
intermediateTool use (function calling) lets LLMs call external APIs, run code, and query databases by describing available functions and receiving structured JSON calls. It's the foundation of all modern LLM agents.
Evaluation
Measure, test, and prevent regressions in LLM applications.
A/B Model Testing
advancedA/B model testing runs two LLM configurations in parallel on real production traffic to measure which produces better outcomes. Unlike offline evals, A/B tests measure actual user behavior and business metrics — the ultimate signal for LLM quality.
Benchmark Selection
intermediateChoosing the right benchmarks determines whether your model evaluation is predictive of real-world performance. Learn which benchmarks matter in 2026, how to avoid benchmark gaming, and when to build domain-specific benchmarks instead.
Evals Framework
intermediateAn evals framework systematically measures LLM application quality across multiple dimensions, catches regressions, and provides actionable feedback. Learn how to structure eval pipelines, write eval functions, and integrate evals into CI/CD.
Golden Dataset
intermediateA golden dataset is a curated set of input/expected output pairs used as ground truth for evaluation. It's the foundation of every reliable LLM eval pipeline. Learn how to build, maintain, and expand golden datasets efficiently.
LLM-as-Judge
intermediateLLM-as-judge uses a language model to score other LLM outputs on quality dimensions like correctness, faithfulness, helpfulness, and safety. It scales evaluation to millions of examples where human labeling is impractical.
Regression Testing
intermediateRegression testing for LLMs catches quality degradations when you change prompts, upgrade models, or modify retrieval systems. Learn how to structure regression tests, set meaningful pass/fail thresholds, and integrate them into your CI/CD pipeline.
Cost & Latency
Cut costs 50-90% and reduce latency without sacrificing quality.
Batch Processing
intermediateBatch APIs (Anthropic, OpenAI) process large volumes of LLM requests asynchronously at 50% discount. Learn when to use batch vs. real-time, how to structure batch jobs, and how to handle failures in large batches.
Cost Optimization
intermediateLLM API costs can spiral from thousands to hundreds of thousands of dollars monthly as applications scale. This guide covers the complete toolkit: model routing, prompt caching, batching, compression, and output constraints — with realistic savings estimates for each.
Latency Optimization
intermediateLLM latency has two components: time-to-first-token (TTFT) and time-to-last-token (generation speed). Learn the techniques to reduce both — streaming, speculative decoding, smaller models, and caching — with concrete benchmarks.
Model Routing
advancedModel routing directs each query to the cheapest or fastest model capable of handling it. By routing simple queries to small models and complex ones to frontier models, most applications can cut costs 50–80% without quality loss.
Token Counting
beginnerTokens are the fundamental unit of LLM pricing and context limits. Understanding how to count tokens accurately lets you predict costs, manage context windows, and debug unexpected billing. Learn the practical differences between tokenizers and common token estimation rules.
Safety & Security
Prevent prompt injection, validate outputs, handle PII.
Guardrails
intermediateGuardrails wrap LLM calls with safety and policy checks on both input and output. They intercept harmful requests, enforce topical scope, detect jailbreaks, and ensure outputs comply with content policies and brand guidelines.
Output Validation
intermediateOutput validation checks LLM responses against schema, content, safety, and business logic rules before acting on them. It's the last line of defense against hallucinations, injection attacks, and model errors in production systems.
PII Handling
intermediateLLM applications that process user data must detect, redact, or handle PII (personally identifiable information) in compliance with GDPR, HIPAA, CCPA, and other regulations. Learn detection, redaction, pseudonymization, and architectural patterns for PII-safe LLM pipelines.
Prompt Injection Defense
intermediatePrompt injection attacks embed malicious instructions in user input, documents, or tool outputs to override system prompts and hijack LLM behavior. Learn detection patterns, defense-in-depth strategies, and why there's no complete solution.